Now we no longer compute the cluster list at the exact moment when an
in-memory inode is created, but rather defer doing that until we
actually need the cluster list. This distinction is important when
traversing directories, as we *are* interested in the inode, but not
really in its contents (though now we do have to resort to using the
file size to approximate the amount of allocated blocks, but this
shouldn't be an issue when working with well-formed filesystems.)
Plain resizes (that increase file size) are now much faster, as the
zeroing is now done in a more efficient way (rather than just looping
over `write_bytes_locked`.) Writes should also be faster, since we no
longer zero out the underlying bytes if we are going to write something
in that very same location anyway.
There's not much use in having this function, as the only remotely
useful thing that it does is that it ensures that the disk cache is
locked before destroying it. However, there shouldn't be any filesystem
activity this late in the unmounting process anyway, so the locking
doesn't seem necessary either.
Make the MemoryType used by mmap() configurable by
File::vmobject_and_memory_type_for_mmap overrides (previously called
vmobject_for_mmap).
All mmap()s were previously MemoryType::Normal.
Immutable mounts are mounts that can't be changed in any aspect, if the
VFSRootContext that hold them is used by a process. This includes two
operations on a mount:
1. Trying to remove the mount from the mount table.
2. Trying to change the flags of the mount.
The condition of a VFSRootContext being held by a process or not is
crucial, as the intention is to allow removal of mounts that marked as
immutable if the VFSRootContext is not being used anymore (for example,
if the container that was created with such context stopped).
Marking mounts as immutable on the first VFS root context essentially
ensures they will never be modified because there will be a process
using that context (which is the "main" VFS root context in the system
runtime).
It should be noted that setting a mount as immutable can be done in
creation time of the mount by passing the MS_IMMUTABLE flag, or by doing
a remount with MS_IMMUTABLE flag.
Previously, the VFS layer would try to handle renames more-or-less by
itself, which really only worked for ext2, and even that was only due to
the replace_child kludge existing specifically for this purpose. This
never worked properly for FATFS, since the VFS layer effectively
depended on filesystems having some kind of reference-counting for
inodes, which is something that simply doesn't exist on any FAT variant
we support.
To resolve various issues with the existing scheme, this commit makes
filesystem implementations themselves responsible for the actual rename
operation, while keeping all the existing validation inside the VFS
layer. The only intended behavior change here is that rename operations
should actually properly work on FATFS.
There were two separate issues at play which made this work incorrectly.
The first was that the newly allocated block was incorrectly computed,
because it was assumed that the cached block list was updated when a new
cluster was allocated. The second issue was that there was an off-by-one
in the loop that collected the newly allocated entries, which meant that
the resulting list had an entry less than what was requested.
The old overload still depended on m_entry being initialized, which
meant that (due to where this method is used) all inodes ended up
getting the same index.
128-byte inodes are more-or-less deprecated since they store timestamps
as unsigned 32-bit integer offsets from the UNIX epoch. Large inodes, on
the other hand, contain records that may specify a different epoch, and
are thus less susceptible to overflows. These very same records also
have the capability to store timestamps with nanosecond precision, which
this commit adds full support for as well.
This replaces all usages of Cacheable::Yes with MemoryType::Normal and
Cacheable::No with either MemoryType::NonCacheable or MemoryType::IO,
depending on the context.
The Page{Directory,Table}::set_cache_disabled function therefore also
has been replaced with a more appropriate set_memory_type_function.
Adding a memory_type "getter" would not be as easy, as some
architectures may not support all memory types, so getting the memory
type again may be a lossy conversion. The is_cache_disabled function
was never used, so just simply remove it altogether.
There is no difference between MemoryType::NonCacheable and
MemoryType::IO on x86 for now.
Other architectures currently don't respect the MemoryType at all.
For some context, write_bytes_locked used to simply bail out before
writing any data if there weren't enough blocks to cover the entire size
of an inode before 1bf7f99a.
We're not actually restoring that behavior here, since computing the
amount of blocks to be allocated would get exceedingly complex,
considering that there could always be holes in between already
allocated blocks.
Instead, this simply makes allocate_blocks() bail out properly if there
aren't any free blocks left.
Fixes#24980
This commit reorganizes the BootInfo struct definition so it can be
shared for all architectures.
The existing free extern "C" boot info variables have been removed and
replaced with a global BootInfo struct, 'g_boot_info'.
On x86-64, the BootInfo is directly copied from the Prekernel-provided
struct.
On AArch64 and RISC-V, BootInfo is populated during pre_init.
This change has many improvements:
- We don't use `LockRefPtr` to hold instances of many base devices as
with the DeviceManagement class. Instead, we have a saner pattern of
holding them in a `NonnullRefPtr<T> const`, in a small-text footprint
class definition in the `Device.cpp` file.
- The awkwardness of using `::the()` each time we need to get references
to mostly-static objects (like the Event queue) in runtime is now gone
in the migration to using the `Device` class.
- Acquiring a device feel more obvious because we use now the Device
class for this method. The method name is improved as well.
Commit ad73adef5d needlessly introduced an IPv4 directory.
If we were to keep it, sharing common headers between IPv4 and IPv6
would be much messier, and may potentially increase code duplication.
This change renames the IPv4 directory to IP to aid with later IPv6
porting efforts.
In Ext2FSInode::compute_block_list_impl(), each call to
process_block_array() creates a new ByteBuffer, which leads to a
kmalloc() call. The ByteBuffer is then discarded when
process_block_array() exits, leading to a kfree() call.
This leads to repeated kmalloc() and kfree() calls as ByteBuffers are
created and destroyed each time process_block_array() is called.
This commit makes it so that only 1 ByteBuffer is created for each level
of inode indirect block (so only 3 ByteBuffers are created at most).
These ByteBuffers are re-used on each call to process_block_array().
This reduces the number of kmalloc() and kfree() calls during
compute_block_list_impl(), especially for larger files.
Instead of using a raw `KBuffer` and letting each implementation to
populating the specific flags on its own, we change things so we only
let each FileSystem implementation to validate the flag and its value
but then store it in a HashMap which its key is the flag name and
the value is a special new class called `FileSystemSpecificOption`
which wraps around `AK::Variant<...>`.
This approach has multiple advantages over the previous:
- It allows runtime inspection of what the user has set on a `MountFile`
description for a specific filesystem.
- It ensures accidental overriding of filesystem specific option that
was already set is not possible
- It removes ugly casting of a `KBuffer` contents to a strongly-typed
values. Instead, a strongly-typed `AK::Variant` is used which ensures
we always get a value without doing any casting.
Please note that we have removed support for ASCII string-oriented flags
as there were no actual use cases, and supporting such type would make
`FileSystemSpecificOption` more complicated unnecessarily for now.
The superblock of an ext2 filesystem is always found on the storage
device at offset 1024. This 1024 number was hardcoded in the Ext2FS
code.
This commit:
* adds a constexpr to replace the hardcoded 1024 values
* removes a comment about one of the the hardcoded 1024 values which is
now umnecessary
There's no point in constructing an object just for the sake of keeping
a state that can be touched by anything in the kernel code.
Let's reduce everything to be in a C++ namespace called with the
previous name "VirtualFileSystem" and keep a smaller textual-footprint
struct called "VirtualFileSystemDetails".
This change also cleans up old "friend class" statements that were no
longer needed, and move methods from the VirtualFileSystem code to more
appropriate places as well.
Please note that the method of locking all filesystems during shutdown
is removed, as in that place there's no meaning to actually locking all
filesystems because of running in kernel mode entirely.
This new syscall will be used by the upcoming runc (run-container)
utility.
In addition to that, this syscall allows userspace to neatly copy RAMFS
instances to other places, which was not possible in the past.
The whole concept of Jails was far more complicated than I actually want
it to be, so let's reduce the complexity of how it works from now on.
Please note that we always leaked the attach count of a Jail object in
the fork syscall if it failed midway.
Instead, we should have attach to the jail just before registering the
new Process, so we don't need to worry about unsuccessful Process
creation.
The reduction of complexity in regard to jails means that instead of
relying on jails to provide PID isolation, we could simplify the whole
idea of them to be a simple SetOnce, and let the ProcessList (now called
ScopedProcessList) to be responsible for this type of isolation.
Therefore, we apply the following changes to do so:
- We make the Jail concept no longer a class of its own. Instead, we
simplify the idea of being jailed to a simple ProtectedValues boolean
flag. This means that we no longer check of matching jail pointers
anywhere in the Kernel code.
To set a process as jailed, a new prctl option was added to set a
Kernel SetOnce boolean flag (so it cannot change ever again).
- We provide Process & Thread methods to iterate over process lists.
A process can either iterate on the global process list, or if it's
attached to a scoped process list, then only over that list.
This essentially replaces the need of checking the Jail pointer of a
process when iterating over process lists.
Expose some initial interfaces in the mount-related syscalls to select
the desired VFSRootContext, by specifying the VFSRootContext index
number.
For now there's still no way to create a different VFSRootContext, so
the only valid IDs are -1 (for currently attached VFSRootContext) or 1
for the first userspace VFSRootContext.
The VFSRootContext class, as its name suggests, holds a context for a
root directory with its mount table and the root custody/inode in the
same class.
The idea is derived from the Linux mount namespace mechanism.
It mimicks the concept of the ProcessList object, but it is adjusted for
a root directory tree context.
In contrast to the ProcessList concept, processes that share the default
VFSRootContext can't see other VFSRootContext related properties such as
as the mount table and root custody/inode.
To accommodate to this change progressively, we internally create 2 main
VFS root contexts for now - one for kernel processes (as they don't need
to care about VFS root contexts for the most part), and another for all
userspace programs.
This separation allows us to continue pretending for userspace that
everything is "normal" as it is used to be, until we introduce proper
interfaces in the mount-related syscalls as well as in the SysFS.
We make VFSRootContext objects being listed, as another preparation
before we could expose interfaces to userspace.
As a result, the PowerStateSwitchTask now iterates on all contexts
and tear them down one by one.
After the previous commit, we are able to create a comprehensive list of
all devices' major number allocations.
To help userspace to distinguish between character and block devices, we
expose 2 sysfs nodes so userspace can decide which list it needs to open
in order to iterate on it.
Instead of putting everything in one hash map, let's distinguish between
the devices based on their type.
This change makes the devices semantically separated, and is considered
a preparation before we could expose a comprehensive list of allocations
per major numbers and their purpose.
With this change, we no longer preallocate blocks when an inode's size
is updated, and instead only allocate the minimum amount of blocks when
the inode is actually written to.