Commit Graph

2 Commits

Author SHA1 Message Date
Liav A.
f6db24dba4 Kernel+runc: Remove the pivot_root functionality in copy_mount syscall
That functionality seems to be too much complicated.
We shouldn't overengineer how the copy_mount syscall works, so instead
of allowing replacement of the root filesystem, let's make the unshare
file descriptor to be configured via a special ioctl call before we
initialize a new VFSRootContext object.

The special ioctl can either set a new root filesystem for the upcoming
VFSRootContext object, or remove it (by passing fd of -1).
If there's no specified root filesystem, a new RAMFS instance will be
created automatically when invoking the unshare_create syscall.

This also simplifies the code in the boot process, hence making it much
more readable.

It should be noted, that we assumed during pivot_root that the first
mountpoint in a context is the root mountpoint, which is probably a fair
assumption, but we don't assume this anywhere else in the VFSRootContext
code.
If this functionality ever comes back, we should ensure that we make
some effort to not assume this again.
2026-03-14 11:45:37 +01:00
Liav A.
2a4a096e0f Kernel+runc: Make unshare syscalls more fd-oriented
Instead of creating a new resource that has its own ID number and work
with it directly, we can create a file that describes the unshared
resource, execute ioctl calls on it and only enter into it in the end,
essentially creating the resource only during the last call instead
of the previous method of creation of a resource when "attaching" to
that resource.

We can enter a resource for current program execution, after the exec
syscall, or both.
That change allows userspace to create a resource and attach to it only
in the new program, which makes it more comfortable to do cleanups or
track the new process, outside of the created container.

It should be noted that until this commit, we entered a resource without
detaching the old one, essentially leaking the attach counter of a
resource. While this bug didn't have severe effects, it was obvious that
a proper cleanup userspace code later on wouldn't work in that situation
anyway, so this commit changes the way we work, and the terminology of
entering a resource is actually to **replace** it.

These changes essentially open an opportunity to extend runc to be a
container manager rather being launcher of a containerized environment,
which makes it possible to do all sorts of nice cleanups and tracking of
containers' states.
2026-03-14 11:45:37 +01:00