Commit Graph

35 Commits

Author SHA1 Message Date
Liav A.
b93ca74d81 Kernel: Add a prctl option to enter jail mode until an execve syscall
In addition to the already existing option to enter jail mode (which is
set indefinitely), there should be a less restrictive option that should
allow exiting jail mode when doing the execve syscall.

This option will be useful for programs that need this kind of security
layer only in their runtime, but they're meant to actually initiate
another program in the end.
2024-10-03 12:39:45 +02:00
Liav A.
b90a36d2a9 Kernel+Userland: Rename jailed => jailed_until_exit
In all instances, it should be clear that the jailing of a process is
ending when the process exits.

This is a preparation before introducing another option to set a process
as jailed until it calls the execve syscall.
2024-10-03 12:39:45 +02:00
Liav A.
fdf3e0aca1 Kernel: Don't assume sizes of needed buffers early in the execve syscall
Instead, start by trying to read a buffer with size of Elf_Ehdr, and
check it for the shebang sign. If it's indeed an executable with shebang
then read again from the file, now with PAGE_SIZE size, which should
suffice for finding the interpreter path.

However, if the executable is an ELF, we quickly validate it and then
pass the preliminary buffer to the find_elf_interpreter_for_executable
method.

That method calculates the last byte offset which is needed to read all
of the program headers, so we don't just assume 4096 bytes is sufficient
anymore. The same pattern is applied when loading the interpreter ELF
main header and its program headers.
2024-09-01 20:52:55 +02:00
Idan Horowitz
1e2919b5c1 Kernel: Add create_kernel_thread overload that accepts a lambda
Also fixes the entry function argument initialization to be
architecture-independent.
2024-07-26 14:25:49 -04:00
Liav A.
4aec3f4ef9 Kernel+Userland: Simplify loading of an ELF interpreter path
The LibELF validate_program_headers method tried to do too many things
at once, and as a result, we had an awkward return type from it.

To be able to simplify it, we no longer allow passing a StringBuilder*
but instead we require to pass an Optional<Elf_Phdr> by reference so
it could be filled with actual ELF program header that corresponds to
an INTERP header if such found.

As a result, we ensure that only certain implementations that actually
care about the ELF interpreter path will actually try to load it on
their own and if they fail, they can have better diagnostics for an
invalid INTERP header.

This change also fixes a bug that on which we failed to execute an ELF
program if the INTERP header is located outside the first 4KiB page of
the ELF file, as the kernel previously didn't have support for looking
beyond that for that header.
2024-07-21 15:38:52 +02:00
Liav A.
0e6624dc86 Kernel: Introduce the unshare syscall family
These 2 syscalls are responsible for unsharing resources in the system,
such as hostname, VFS root contexts and process lists.

Together with an appropriate userspace implementation, these syscalls
could be used for creating a sandbox environment (containers) for user
programs.
2024-07-21 11:44:23 +02:00
Liav A.
e52abd4c09 Kernel: Introduce the HostnameContext class
Similarly to VFSRootContext and ScopedProcessList, this class intends
to form resource isolation as well.
We add this class as an infrastructure preparation of hostname contexts
which should allow processes to obtain different hostnames on the same
machine.
2024-07-21 11:44:23 +02:00
Liav A.
4370bbb3ad Kernel+Userland: Introduce the copy_mount syscall
This new syscall will be used by the upcoming runc (run-container)
utility.

In addition to that, this syscall allows userspace to neatly copy RAMFS
instances to other places, which was not possible in the past.
2024-07-21 11:44:23 +02:00
Liav A.
dd59fe35c7 Kernel+Userland: Reduce jails to be a simple boolean flag
The whole concept of Jails was far more complicated than I actually want
it to be, so let's reduce the complexity of how it works from now on.
Please note that we always leaked the attach count of a Jail object in
the fork syscall if it failed midway.
Instead, we should have attach to the jail just before registering the
new Process, so we don't need to worry about unsuccessful Process
creation.

The reduction of complexity in regard to jails means that instead of
relying on jails to provide PID isolation, we could simplify the whole
idea of them to be a simple SetOnce, and let the ProcessList (now called
ScopedProcessList) to be responsible for this type of isolation.

Therefore, we apply the following changes to do so:
- We make the Jail concept no longer a class of its own. Instead, we
  simplify the idea of being jailed to a simple ProtectedValues boolean
  flag. This means that we no longer check of matching jail pointers
  anywhere in the Kernel code.
  To set a process as jailed, a new prctl option was added to set a
  Kernel SetOnce boolean flag (so it cannot change ever again).
- We provide Process & Thread methods to iterate over process lists.
  A process can either iterate on the global process list, or if it's
  attached to a scoped process list, then only over that list.
  This essentially replaces the need of checking the Jail pointer of a
  process when iterating over process lists.
2024-07-21 11:44:23 +02:00
Liav A.
91c87c5b77 Kernel+Userland: Prepare for considering VFSRootContext when mounting
Expose some initial interfaces in the mount-related syscalls to select
the desired VFSRootContext, by specifying the VFSRootContext index
number.

For now there's still no way to create a different VFSRootContext, so
the only valid IDs are -1 (for currently attached VFSRootContext) or 1
for the first userspace VFSRootContext.
2024-07-21 11:44:23 +02:00
Liav A.
01e1af732b Kernel/FileSystem: Introduce the VFSRootContext class
The VFSRootContext class, as its name suggests, holds a context for a
root directory with its mount table and the root custody/inode in the
same class.

The idea is derived from the Linux mount namespace mechanism.
It mimicks the concept of the ProcessList object, but it is adjusted for
a root directory tree context.
In contrast to the ProcessList concept, processes that share the default
VFSRootContext can't see other VFSRootContext related properties such as
as the mount table and root custody/inode.

To accommodate to this change progressively, we internally create 2 main
VFS root contexts for now - one for kernel processes (as they don't need
to care about VFS root contexts for the most part), and another for all
userspace programs.
This separation allows us to continue pretending for userspace that
everything is "normal" as it is used to be, until we introduce proper
interfaces in the mount-related syscalls as well as in the SysFS.

We make VFSRootContext objects being listed, as another preparation
before we could expose interfaces to userspace.
As a result, the PowerStateSwitchTask now iterates on all contexts
and tear them down one by one.
2024-07-21 11:44:23 +02:00
Liav A.
ecc9c5409d Kernel: Ignore dirfd if absolute path is given in VFS-related syscalls
To be able to do this, we add a new class called CustodyBase, which can
be resolved on-demand internally in the VirtualFileSystem resolving path
code.

When being resolved, CustodyBase will return a known custody if it was
constructed with such, if that's not the case it will provide the root
custody if the original path is absolute.
Lastly, if that's not the case as well, it will resolve the given dirfd
to provide a Custody object.
2024-06-01 19:25:15 +02:00
Liav A.
15ddc1f17a Kernel+Userland: Reject W->X prot region transition after a prctl call
We add a prctl option which would be called once after the dynamic
loader has finished to do text relocations before calling the actual
program entry point.

This change makes it much more obvious when we are allowed to change
a region protection access from being writable to executable.
The dynamic loader should be able to do this, but after a certain point
it is obvious that such mechanism should be disabled.
2024-05-14 12:41:51 -06:00
Sönke Holz
243d7003a2 Kernel+LibC+LibELF: Move TLS handling to userspace
This removes the allocate_tls syscall and adds an archctl option to set
the fs_base for the current thread on x86-64, since you can't set that
register from userspace. enter_thread_context loads the fs_base for the
next thread on each context switch.
This also moves tpidr_el0 (the thread pointer register on AArch64) to
the register state, so it gets properly saved/restored on context
switches.

The userspace TLS allocation code is kept pretty similar to the original
kernel TLS code, aside from a couple of style changes.

We also have to add a new argument "tls_pointer" to
SC_create_thread_params, as we otherwise can't prevent race conditions
between setting the thread pointer register and signal handling code
that might be triggered before the thread pointer was set, which could
use TLS.
2024-04-19 16:46:47 -06:00
Sönke Holz
57f4f8caf8 Kernel+LibC: Introduce new archctl syscall
This syscall will be used for architecture-specific operations.
2024-04-19 16:46:47 -06:00
Space Meyer
a721e4d507 Kernel: Track KCOVInstance via Process instead of HashMap
While this clutters Process.cpp a tiny bit, I feel that it's worth it:
- 2x speed on the kcov_loop benchmark. Likely more during fuzzing.
- Overall code complexity is going down with this change.
- By reducing the code reachable from __sanitizer_cov_trace_pc code,
  we can now instrument more code.
2024-04-15 21:16:22 -06:00
Idan Horowitz
6a4b93b3e0 Kernel: Protect processes' master TLS with a fine-grained spinlock
This moves it out of the scope of the big process lock, and allows us
to wean some syscalls off it, starting with sys$allocate_tls.
2023-12-26 19:20:21 +01:00
Idan Horowitz
a49b7e92eb Kernel: Shrink instead of expand sigaltstack range to page boundaries
Since the POSIX sigaltstack manpage suggests allocating the stack
region using malloc(), and many heap implementations (including ours)
store heap chunk metadata in memory just before the vended pointer,
we would end up zeroing the metadata, leading to various crashes.
2023-12-24 16:11:35 +01:00
Daniel Bertalan
45d81dceed Everywhere: Replace ElfW(type) macro usage with Elf_type
This works around a `clang-format-17` bug which caused certain usages to
be misformatted and fail to compile.

Fixes #8315
2023-12-01 10:02:39 +02:00
Tim Schumacher
a2f60911fe AK: Rename GenericTraits to DefaultTraits
This feels like a more fitting name for something that provides the
default values for Traits.
2023-11-09 10:05:51 -05:00
Jakub Berkop
54e79aa1d9 Kernel+ProfileViewer: Display additional filesystem events 2023-09-09 11:26:51 -06:00
Liav A
1c0aa51684 Kernel+Userland: Remove the {get,set}_thread_name syscalls
These syscalls are not necessary on their own, and they give the false
impression that a caller could set or get the thread name of any process
in the system, which is not true.

Therefore, move the functionality of these syscalls to be options in the
prctl syscall, which makes it abundantly clear that these operations
could only occur from a running thread in a process that sees other
threads in that process only.
2023-08-25 11:51:52 +02:00
Daniel Bertalan
286984750e Kernel+LibC: Pass 64-bit integers in syscalls by value
Now that support for 32-bit x86 has been removed, we don't have to worry
about the top half of `off_t`/`u64` values being chopped off when we try
to pass them in registers. Therefore, we no longer need the workaround
of pointers to stack-allocated values to syscalls.

Note that this changes the system call ABI, so statically linked
programs will have to be re-linked.
2023-08-12 01:14:26 +02:00
Liav A
d8b514873f Kernel: Use FixedStringBuffer for fixed-length strings in syscalls
Using the kernel stack is preferable, especially when the examined
strings should be limited to a reasonable length.

This is a small improvement, because if we don't actually move these
strings then we don't need to own heap allocations for them during the
syscall handler function scope.

In addition to that, some kernel strings are known to be limited, like
the hostname string, for these strings we also can use FixedStringBuffer
to store and copy to and from these buffers, without using any heap
allocations at all.
2023-08-09 21:06:54 -06:00
Liav A
3fd4997fc2 Kernel: Don't allocate memory for names of processes and threads
Instead, use the FixedCharBuffer class to ensure we always use a static
buffer storage for these names. This ensures that if a Process or a
Thread were created, there's a guarantee that setting a new name will
never fail, as only copying of strings should be done to that static
storage.

The limits which are set are 32 characters for processes' names and 64
characters for thread names - this is because threads' names could be
more verbose than processes' names.
2023-08-09 21:06:54 -06:00
Tim Schumacher
9d6372ff07 Kernel: Consolidate finding the ELF stack size with validation
Previously, we started parsing the ELF file again in a completely
different place, and without the partial mapping that we do while
validating.

Instead of doing manual parsing in two places, just capture the
requested stack size right after we validated it.
2023-07-10 21:08:31 -06:00
Timothy Flynn
c911781c21 Everywhere: Remove needless trailing semi-colons after functions
This is a new option in clang-format-16.
2023-07-08 10:32:56 +01:00
Liav A
23a7ccf607 Kernel+LibCore+LibC: Split the mount syscall into multiple syscalls
This is a preparation before we can create a usable mechanism to use
filesystem-specific mount flags.
To keep some compatibility with userland code, LibC and LibCore mount
functions are kept being usable, but now instead of doing an "atomic"
syscall, they do multiple syscalls to perform the complete procedure of
mounting a filesystem.

The FileBackedFileSystem IntrusiveList in the VFS code is now changed to
be protected by a Mutex, because when we mount a new filesystem, we need
to check if a filesystem is already created for a given source_fd so we
do a scan for that OpenFileDescription in that list. If we fail to find
an already-created filesystem we create a new one and register it in the
list if we successfully mounted it. We use a Mutex because we might need
to initiate disk access during the filesystem creation, which will take
other mutexes in other parts of the kernel, therefore making it not
possible to take a spinlock while doing this.
2023-07-02 01:04:51 +02:00
implicitfield
79adeb626b LibC+LibELF: Move ELF definitions from LibC to LibELF
This is needed to avoid including LibC headers in Lagom builds.
Unfortunately, we cannot rely on the build machine to provide a
fully POSIX-compatible ELF header for Lagom builds, so we have to
use our own.
2023-06-27 12:40:38 +02:00
Jelle Raaijmakers
81a6976e90 Kernel: De-atomicize fields for promises in Process
These 4 fields were made `Atomic` in
c3f668a758, at which time these were still
accessed unserialized and TOCTOU bugs could happen. Later, in
8ed06ad814, we serialized access to these
fields in a number of helper methods, removing the need for `Atomic`.
2023-06-09 17:15:54 +02:00
Tim Ledbetter
f25530a12d Kernel: Store creation time when creating a process 2023-06-09 17:15:41 +02:00
Liav A
927926b924 Kernel: Move Performance-measurement code to the Tasks subdirectory 2023-06-04 21:32:34 +02:00
Liav A
7c0540a229 Everywhere: Move global Kernel pattern code to Kernel/Library directory
This has KString, KBuffer, DoubleBuffer, KBufferBuilder, IOWindow,
UserOrKernelBuffer and ScopedCritical classes being moved to the
Kernel/Library subdirectory.

Also, move the panic and assertions handling code to that directory.
2023-06-04 21:32:34 +02:00
Liav A
ee0ccdaebe Kernel: Move Credentials.{cpp,h} to the Security subdirectory 2023-06-04 21:32:34 +02:00
Liav A
1b04726c85 Kernel: Move all tasks-related code to the Tasks subdirectory 2023-06-04 21:32:34 +02:00