namespace - Podman: Problems with user namespaces - 《kubernetes》

Going forward

For several years, I’ve advocated user namespace as the security tool everyone wants but hardly anyone has used. The reason is there hasn’t been any filesystem support or a shifting file system.
In containers, you want to share the base image between lots of containers. The examples above use the Fedora base image in each example. Most of the files in the Fedora image are owned by real UID=0. If I run a container on this image with the user namespace 0:100000:5000, by default it sees all of these files as owned by nobody, so we need to shift all of these UIDs to match the user namespace. For years, I’ve wanted a mount option to tell the kernel to remap these file UIDs to match the user namespace. Upstream kernel storage developers continue to investigate and make progress on this feature, but it is a difficult problem.

Podman can use different user namespaces on the same image because of automatic chowning built into containers/storage by a team led by Nalin Dahyabhai. Podman uses containers/storage, and the first time Podman uses a container image in a new user namespace, container/storage “chowns” (i.e., changes ownership for) all files in the image to the UIDs mapped in the user namespace and creates a new image. Think of this as the fedora:0:100000:5000 image.

When Podman runs another container on the image with the same UID mappings, it uses the “pre-chowned” image. When I run the second container on 0:200000:5000, containers/storage creates a second image, let’s call it fedora:0:200000:5000.

Note if you are doing a podman build or podman commit and push the newly created image to a container registry, Podman will use container/storage to reverse the shift and push the image with all files chowned back to real UID=0.
This can cause a real slowdown in creating containers in new UID mappings since the chown can be slow depending on the number of files in the image. Also, on a normal OverlayFS, every file in the image gets copied up. The normal Fedora image can take up to 30 seconds to finish the chown and start the container.
Luckily, the Red Hat kernel storage team, primarily Vivek Goyal and Miklos Szeredi, added a new feature to OverlayFS in kernel 4.19. The feature is called metadata only copy-up. If you mount an overlay filesystem with metacopy=on as a mount option, it will _not _copy up the contents of the lower layers when you change file attributes; the kernel creates new inodes that include the attributes with references pointing at the lower-level data. It will still copy up the contents if the content changes. This functionality is available in the Red Hat Enterprise Linux 8 Beta, if you want to try it out.
This means container chowning can happen in a couple of seconds, and you won’t double the storage space for each container.
This makes running containers with tools like Podman in separate user namespaces viable, greatly increasing the security of the system.

Going forward

I want to add a new flag, like —userns=auto, to Podman that will tell it to automatically pick a unique user namespace for each container you run. This is similar to the way SELinux works with separate multi-category security (MCS) labels. If you set the environment variable PODMAN_USERNS=auto, you won’t even need to set the flag.
Podman is finally allowing users to run containers in separate user namespaces. Tools likeBuildahandCRI-O will also be able to take advantage of user namespaces. For CRI-O, however, Kubernetes needs to understand which user namespace will run the container engine, and the upstream is working on that.
In my next article, I will explain how to run Podman as non-root in a user namespace.