This essentially reverts 5ada38f9c3.
Previously, two threads could end up trying to allocate a committed
page at once, possibly resulting in a panic because we tried to
allocate more pages than committed.
Another problem was that a thread could incorrectly think that the page
fault was already handled. This can happen if the thread handling the
page fault already set the physical page slot to the newly allocated
page, but didn't remap the page yet. We check if a page fault was
already processed based on the physical page slot contents.
This issue is not causing problems currently, since thinking a page
fault was already handled and incorrectly returning will still work
eventually when the other thread is done remapping the page.
However, a future commit will add extra assertions checking that page
faults were already handled appropriately if we couldn't find a reason
for the fault. These assertions would trip on this.
Prevent these issues by taking the lock for a longer amount of time.
There might be a better solution to this, but that would likely require
more complex code changes.
Also modify the code in handle_fault() a bit to avoid using should_cow()
for zero faults. The checks in should_cow() can refer to a different
physical page if the page fault was handled immediately after the check.
Instead of having to maintain two different page fault handler
implementations, let's unify the two by using the more generic RISC-V
implementation.
The RISC-V implementation doesn't depend on the processor providing the
reason why a page fault occured.
We don't need to know whether it's a NotPresent or ProtectionViolation
fault to handle it correctly, as we already have enough metadata.
This causes us to print a more useful error message than "Unexpected
page fault".
Additionally, this change will be necessary in a future commit, which
expects us to handle all reasons for page faults exhaustively.
This matches the x86-64 and AArch64 behavior.
This required moving the is_instruction_fetch() check before the
is_read() check, since is_read() is now also true for instruction fetch
page faults.
... and always use it for inode write faults.
In the RISC-V page fault implementation, we previously only used this
function if `should_dirty_on_write()` returned true. If it didn't, we'd
use `handle_inode_fault()`. But this shouldn't be necessary, since
`handle_dirty_on_write_fault()` already handles cases where the page
isn't mapped yet.
This should now also cause a page to be immediately marked as dirty
if the first access to it was a write. Previously, this should have
caused two page faults: one for loading the inode page and one for
making it dirty.
This code in the x86-64 and AArch64 page fault handler should be
unreachable, since Region::map_individual_page_impl() always maps all
non-null physical pages, therefore we should never get a PageNotPresent
fault if the page slot is set to a non-null value.
Lazy committed pages are implemented by mapping them read-only, so a
write access to them will result in a ProtectionViolation fault, which
will call very similar code in Region::handle_zero_fault().
Similarly in the RISC-V version, we already handle lazy committed page
faults by calling Region::handle_zero_fault() a couple lines earlier
if the page fault was generated by a write access and the region is
writable.
The stricter W^X protection introduced by af3d3c5c4a was accidentally
broken by 5194ab59b5, since it didn't set the shadow permission bits
to the initial Region permissions.
The shift operations were originally introduced in af3d3c5 to
record the permission bits set for Region, but now has been
replaced by the method introduced in 5194ab59b, so the shift
operations can be removed.
Clang compiles that builtin to an abort for freestanding environments
because RISC-V does not have an instruction to the flush the instruction
cache for all harts.
We don't support SMP on RISC-V currently, so simply use a `fence.i` for
now.
This replaces all usages of Cacheable::Yes with MemoryType::Normal and
Cacheable::No with either MemoryType::NonCacheable or MemoryType::IO,
depending on the context.
The Page{Directory,Table}::set_cache_disabled function therefore also
has been replaced with a more appropriate set_memory_type_function.
Adding a memory_type "getter" would not be as easy, as some
architectures may not support all memory types, so getting the memory
type again may be a lossy conversion. The is_cache_disabled function
was never used, so just simply remove it altogether.
There is no difference between MemoryType::NonCacheable and
MemoryType::IO on x86 for now.
Other architectures currently don't respect the MemoryType at all.
Writes to SharedInodeVMObjects could cause a Protection Violation if a
page was marked as dirty by a different process.
This happened due to a combination of 2 things:
* handle_dirty_on_write_fault() was skipped if a page was already marked
as dirty
* when a page was marked as dirty, only the Region that caused the page
fault was remapped
This commit:
* fixes the crash by making handle_fault() stop checking if a page was
marked dirty before running handle_dirty_on_write_fault()
* modifies handle_dirty_on_write_fault() so that it always marks the
page as dirty and remaps the page (this avoids a 2nd bug that was
never hit due to the 1st bug)
This commit introduces VMObject::remap_regions_single_page(). This
method remaps a single page in all regions associated with a VMObject.
This is intended to be a more efficient replacement for remap_regions()
in cases where only a single page needs to be remapped.
This commit also updates the cow page fault handling code to use this
new method.
Writes to a MAP_SHARED | MAP_ANONYMOUS mmap region were not visible to
other processes sharing the mmap region. This was happening because the
page fault handler was not remapping the VMObject's m_regions after
allocating a new page.
This commit fixes the problem by calling remap_regions() after assigning
a new page to the VMObject in the page fault handler. This remapping
only occurs for shared Regions.
This commit makes the following minor changes to handle_zero_fault():
* cleans up a call to static_cast(), replacing it with a reference (a
future commit will also use this reference).
* replaces a call to vmobject() with the new reference mentioned above.
* moves the definition of already_handled to inside the block where
already_handled is used.
AddressSpace::try_allocate_split_region() was updating the cow map of
new_region based on the cow map of source_region.
The problem is that both new_region and source_region reference the
same vmobject and the same cow map, so these cow map updates didn't
actually change anything.
This commit:
* removes the cow map updates from try_allocate_split_region()
* removes Region::set_should_cow() since it is no longer used
InodeVMObjects now track dirty and clean pages. This tracking of
dirty and clean pages is used by the msync and purge syscalls.
dirty page tracking works using the following rules:
* when a new InodeVMObject is made, all pages are marked clean.
* writes to clean InodeVMObject pages will cause a page fault,
the fault handler will mark the page as dirty.
* writes to dirty InodeVMObject pages do not cause page faults.
* if msync is called, only dirty pages are flushed to storage (and
marked clean).
* if purge syscall is called, only clean pages are discarded.
As MMIO is placed at fixed physical addressed, and does not need to be
backed by real RAM physical pages, there's no need to use PhysicalPage
instances to track their pages.
This results in slightly reduced allocations, but more importantly
makes MMIO addresses which end up after the normal RAM ranges work,
like 64-bit PCI BARs usually are.
Instead, rewrite the region page fault handling code to not use
PageFault::type() on RISC-V.
I split Region::handle_fault into having a RISC-V-specific
implementation, as I am not sure if I cover all page fault handling edge
cases by solely relying on MM's own region metadata.
We should probably also take the processor-provided page fault reason
into account, if we decide to merge these two implementations in the
future.
This has KString, KBuffer, DoubleBuffer, KBufferBuilder, IOWindow,
UserOrKernelBuffer and ScopedCritical classes being moved to the
Kernel/Library subdirectory.
Also, move the panic and assertions handling code to that directory.
Previously we had a race condition in the page fault handling: We were
relying on the affected Region staying alive while handling the page
fault, but this was not actually guaranteed, as an munmap from another
thread could result in the region being removed concurrently.
This commit closes that hole by extending the lifetime of the region
affected by the page fault until the handling of the page fault is
complete. This is achieved by maintaing a psuedo-reference count on the
region which counts the number of in-progress page faults being handled
on this region, and extending the lifetime of the region while this
counter is non zero.
Since both the increment of the counter by the page fault handler and
the spin loop waiting for it to reach 0 during Region destruction are
serialized using the appropriate AddressSpace spinlock, eventual
progress is guaranteed: As soon as the region is removed from the tree
no more page faults on the region can start.
And similarly correctness is ensured: The counter is incremented under
the same lock, so any page faults that are being handled will have
already incremented the counter before the region is deallocated.
The handling of page tables is very architecture specific, so belongs
in the Arch directory. Some parts were already architecture-specific,
however this commit moves the rest of the PageDirectory class into the
Arch directory.
While we're here the aarch64/PageDirectory.{h,cpp} files are updated to
be aarch64 specific, by renaming some members and removing x86_64
specific code.
These instances were detected by searching for files that include
AK/Memory.h, but don't match the regex:
\\b(fast_u32_copy|fast_u32_fill|secure_zero|timing_safe_compare)\\b
This regex is pessimistic, so there might be more files that don't
actually use any memory function.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
This step would ideally not have been necessary (increases amount of
refactoring and templates necessary, which in turn increases build
times), but it gives us a couple of nice properties:
- SpinlockProtected inside Singleton (a very common combination) can now
obtain any lock rank just via the template parameter. It was not
previously possible to do this with SingletonInstanceCreator magic.
- SpinlockProtected's lock rank is now mandatory; this is the majority
of cases and allows us to see where we're still missing proper ranks.
- The type already informs us what lock rank a lock has, which aids code
readability and (possibly, if gdb cooperates) lock mismatch debugging.
- The rank of a lock can no longer be dynamic, which is not something we
wanted in the first place (or made use of). Locks randomly changing
their rank sounds like a disaster waiting to happen.
- In some places, we might be able to statically check that locks are
taken in the right order (with the right lock rank checking
implementation) as rank information is fully statically known.
This refactoring even more exposes the fact that Mutex has no lock rank
capabilites, which is not fixed here.
According to Dr. POSIX, we should allow to call mmap on inodes even on
ranges that currently don't map to any actual data. Trying to read or
write to those ranges should result in SIGBUS being sent to the thread
that did violating memory access.
To implement this restriction, we simply check if the result of
read_bytes on an Inode returns 0, which means we have nothing valid to
map to the program, hence it should receive a SIGBUS in that case.
According to Dr. POSIX, we should allow to call mmap on inodes even on
ranges that currently don't map to any actual data. Trying to read or
write to those ranges should result in SIGBUS being sent to the thread
that did violating memory access.
Globally shared MemoryManager state is now kept in a GlobalData struct
and wrapped in SpinlockProtected.
A small set of members are left outside the GlobalData struct as they
are only set during boot initialization, and then remain constant.
This allows us to access those members without taking any locks.
I believe this to be safe, as the main thing that LockRefPtr provides
over RefPtr is safe copying from a shared LockRefPtr instance. I've
inspected the uses of RefPtr<PhysicalPage> and it seems they're all
guarded by external locking. Some of it is less obvious, but this is
an area where we're making continuous headway.
This allows sys$mprotect() to honor the original readable & writable
flags of the open file description as they were at the point we did the
original sys$mmap().
IIUC, this is what Dr. POSIX wants us to do:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mprotect.html
Also, remove the bogus and racy "W^X" checking we did against mappings
based on their current inode metadata. If we want to do this, we can do
it properly. For now, it was not only racy, but also did blocking I/O
while holding a spinlock.
We were holding the MM lock across all of the region unmapping code.
This was previously necessary since the quickmaps used during unmapping
required holding the MM lock.
Now that it's no longer necessary, we can leave the MM lock alone here.
You're still required to disable interrupts though, as the mappings are
per-CPU. This exposed the fact that our CR3 lookup map is insufficiently
protected (but we'll address that in a separate commit.)
Until now, our kernel has reimplemented a number of AK classes to
provide automatic internal locking:
- RefPtr
- NonnullRefPtr
- WeakPtr
- Weakable
This patch renames the Kernel classes so that they can coexist with
the original AK classes:
- RefPtr => LockRefPtr
- NonnullRefPtr => NonnullLockRefPtr
- WeakPtr => LockWeakPtr
- Weakable => LockWeakable
The goal here is to eventually get rid of the Lock* classes in favor of
using external locking.
As soon as we've saved CR2 (the faulting address), we can re-enable
interrupt processing. This should make the kernel more responsive under
heavy fault loads.
Region::physical_page() now takes the VMObject lock while accessing the
physical pages array, and returns a RefPtr<PhysicalPage>. This ensures
that the array access is safe.
Region::physical_page_slot() now VERIFY()'s that the VMObject lock is
held by the caller. Since we're returning a reference to the physical
page slot in the VMObject's physical page array, this is the best we
can do here.
We really only need the VMObject lock when accessing the physical pages
array, so once we have a strong pointer to the physical page we want to
remap, we can give up the VMObject lock.
This fixes a deadlock I encountered while building DOOM on SMP.
When handling a page fault, we only need to remap the faulting region in
the current process. There's no need to traverse *all* regions that map
the same VMObject and remap them cross-process as well.
Those other regions will get remapped lazily by their own page fault
handlers eventually. Or maybe they won't and we avoided some work. :^)