Background
Now that docker supports cgroups v2 I would like to take full advantage of it.
When I run a container with a private group using --cgroupns=private
the nested cgroup2
filesystem created by systemd scope gets mounted into the containers /sys/fs/cgroup
path properly, however, docker mounts it read-only by default:
cgroup2 on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec)
Rationale
Technical considerations
I think that this is legacy behaviour which was correct for cgroupv1 where the system-global cgroupfs was mounted into the container as rw
rights would be a gaping security hole.
According to my knowledge a nested cgroup with delegated controllers should be able to write into /sys/fs/cgroup
by design without negative security implications.
Target use-cases
Right now running containers with (nested) systemd init or other container runtime requires multiple hacks which seriously expose security and have portability problems.
Solving this problem would enable an easier, more secure and possibly even transparent mechanism for:
- allowing containers with nested systemd init to manage its own resources via slices and scopes – kind of like with LXC’s nested mode but without the nasty security implication of bind mounting the real cgroupfs into the container
- allowing nested containerized workloads with the help of
fuse-overlayfs
The goal
My goal is to adjust the code so the cgroup2 filesystem is mounted read-write when container is run with a private cgroupns with delegated controllers.
The problem
The problem is that I don’t really know where to look. Which part of the stack is actually responsible for this? Is it docker, moby, containerd, runc or maybe systemd?
So far I’ve found the default settings in the moby project, but they are for cgroupv1.
Where do I find the code that I need to modify and submit a PR to?
PS For a more detailed writeup see my answer on serverfault and my post on r/docker.