Improving Kubernetes and container security with user namespaces

Author: Alban Crequy

In this blog post, I will introduce user namespaces, explain why they are useful for containers and how they interact with Linux capabilities and filesystems. Then, I will explain the work we’ve done on two user namespace projects with Netflix: adding unprivileged user namespace support to FUSE and current work we’re doing to enable user namespaces in Kubernetes.

What are user namespaces?

Containers on Linux are not a first-class concept, instead they are built upon namespaces, cgroups and other Linux primitives. The Linux API offers different kinds of namespaces and they each isolate a specific aspect of the operating system. For example, two containers in different network namespaces will not see each other’s network interfaces. Two containers in different PID namespaces will not see each other’s processes.

User namespaces are similar: they isolate user IDs and group IDs from each other. On Linux, all files and all processes are owned by a specific user id and group id, usually defined in /etc/passwd and /etc/group. A user namespace can be configured to let a container only see a subset of the host’s user IDs and group IDs.

In the example below, the two containers are configured to use distinct sets of user IDs, offering more isolation between themselves and between the containers and the host. Processes and files in container 1 of the example might have the illusion of being root (user id 0), but are in fact using user id 100000 on the host.

2 containers as 2 different UIDs

User namespaces and capabilities

This possibility to appear as user “root” in the container but, in fact, be another user is an important feature of user namespaces. To go in more details about this, let me first sum up the history of this feature in the Linux kernel:

Before Linux 2.2, there was one user, “root” with user id 0, who could do privileged operations like configuring the network and mounting new filesystems. Regular users couldn’t do those privileged operations.
Since Linux 2.2, the privileges from the “root” user are split into different capabilities, e.g. CAP_NET_ADMIN to configure the network and CAP_SYS_ADMIN to mount a new filesystem.
Linux 3.8 introduced user namespaces in 2013 and, with it, capabilities are no longer global but interpreted in the context of a user namespace.

As an example of what that means, we’ll consider the scenario where a container executes the program “sshfs” to mount a ssh filesystem using fuse. Mounting a filesystem requires the capability CAP_SYS_ADMIN. This is not something normally granted to containers as it would give them too much power and defeats the purpose of isolation between the host and container.

When a container is set up without a new user namespace, the only way to allow it to execute “sshfs” successfully is to give CAP_SYS_ADMIN to it, which is unfortunate due to the negative security implications discussed above—even though “sshfs” does not want to impact the host.

PID, Mount Namespace, and Initial User Namespace

But when the container is using a new user namespace, it can be given CAP_SYS_ADMIN in that user namespace without having CAP_SYS_ADMIN on the host user namespace. Being “root” (user id 0) in the container does not impact the host because the real user id is not root on the host.

More precisely, to mount a filesystem in the mount namespace of the container requires the process to have CAP_SYS_ADMIN in the user namespace owning that mount namespace.

PID, Mount Namespace, and Unshared User Namesapce, from the Initial User Namespace

User namespaces and filesystems

From the explanation above, it looks like thanks to user namespaces, it is possible to allow containers to mount filesystems without being root on the host. But depending on the filesystem, there are more pieces that come into play. Linux keeps a list of filesystems deemed safe for mounting in user namespaces, marked with the FS_USERNS_MOUNT flag. Creating a new mount in a non-initial user namespace is only allowed with those filesystem types. You can find the list in the Linux git repository with:

git grep -nw FS_USERNS_MOUNT

As you can see in the table below, new filesystem mounts in non-initial user namespaces were initially restricted to only procfs and sysfs, and then more and more filesystems were allowed along the years to be mounted in user namespaces. This is because it takes time to ensure that a filesystem is safe to use by unprivileged users.

filesystem	Allowed in user namespaces? (`FS_USERNS_MOUNT` flag)
Procfs, sysfs	Yes, since Linux 3.8, 2012 (4f326c0064b20)
tmpfs	Yes, since Linux 3.9, 2013 (b3c6761d9b5cc)
cgroupfs	Yes, since Linux 4.6, 2016 (1c53753e0df1a)
FUSE filesystem	Yes, since Linux 4.18, 2018 (4ad769f3c346), or sooner on Ubuntu kernels
overlay filesystem	Yes, for (Linux 5.11, 2020, and sooner on Ubuntu kernels (patch))
NFS, ext4, btrfs, etc.	No

Impact on container security

User namespaces is another layer of security that isolates the host from the container. There has been a series of container vulnerabilities that were mitigated with user namespaces and it would be safe to assume that future vulnerabilities would be mitigated as well.

As an example, there was CVE-2019-5736 (fix in runc) where the host runc binary could be overwritten from the container. The announcement of the vulnerability mentions that using user namespaces is one of the possible mitigations.

This is because even though the vulnerability allows a process in the container to have a reference to runc on the host via /proc/self/exe, the runc binary would be owned by a user (root) that is not mapped in the container. The container would perceive runc as belonging to the special user “nobody” and the “root” user in the container would not have write rights on it.

We had a blog post explaining the details of the issue and how Flatcar Container Linux mitigated it in a different way, by using a read-only filesystem for hosting the runc binary. The diagram below shows the sequence of steps used by runc to create a container, and how the /proc/self/exe can be a reference to runc. For more details about how that works, see our blog post.

runc does a fork and unshares namespaces

The Kubernetes User Namespaces KEP (KEP/127) lists a couple of other vulnerabilities that can be mitigated with user namespaces.

Enabling user namespaces for FUSE filesystems

Before Linux 4.18, FUSE was not allowed in user namespaces but the effort to make it work was mostly complete: Ubuntu kernels already had a patch set to support it.

The missing part for Linux upstream was proper integration with the integrity subsystem, IMA (Integrity Measurement Architecture). On Linux systems that use IMA, the kernel can detect if files are modified before allowing processes to read or execute them. It enables an audit trail of what is executed. It can also enable the Linux Extended Verification Module (EVM) to enforce a policy to only execute programs if their content is known to be good or signed by a trusted encryption key.

As opposed to files stored on a local hard disk with a traditional filesystem, files on filesystems such as FUSE can be modified without the kernel being able to re-measure, re-appraise, and re-audit files before being served.

We wrote a patch on the memfs FUSE driver to demonstrate the following problem in this scenario:

On the first request, the FUSE driver serves a file with its initial content.
IMA provides a measurement of the file.
On the second request, the FUSE driver serves the same file with an altered content.
IMA does not measure the content again so the altered content is not measured.

At the time, there was a patch in discussion attempting to fix the issue with a “force” option in IMA that would force IMA to always re-measure, re-appraise, and re-audit files based on a policy, for example all files served by FUSE. We tested the effectiveness of that patch.

We sent several patch sets to explore different options.

The first option explored was to patch IMA so that it recognises FUSE filesystems as special and use the “force” option on them, so the measurements are performed for each request, even if the kernel didn’t detect any changes. We received the review that the IMA subsystem should not have to know the different filesystems in order to behave differently in the case of FUSE. Indeed, other filesystems could have the same issue such as remote filesystems. With this option, IMA needs to know about the behaviour of all filesystems, and it is putting this knowledge at the wrong layer.

In order to solve this, the second option explored was to introduce a new filesystem flag FS_NO_IMA_CACHE (v1, v2, v3, v4) that allows filesystems to announce to the IMA subsystem their own behaviour with regard to caching. The IMA subsystem can then check the flag and use the “force” option if the flag is present. In that way, the IMA subsystem does not need to know about the different filesystems.

However, the IMA “force” option does not solve all the issues and signatures still can’t be verified meaningfully. In the end, the following solution was implemented in Linux 4.17 by IMA maintainers:

Add a new flag SB_I_IMA_UNVERIFIABLE_SIGNATURE that filesystems can use to announce to IMA that their signatures cannot be verified.
Right now, only FUSE filesystems make use of this flag. For more security, IMA users can use the IMA policy “fail_securely” that makes IMA signature verification on FUSE filesystems fail even without a user namespace.
Add another new flag SB_I_UNTRUSTED_MOUNTER that filesystems can use when they are mounted from a user namespace that cannot be trusted. In that case, the IMA measurement would fail.

After this IMA fix in Linux 4.17, FUSE filesystems in a non-initial user namespace are finally allowed in Linux 4.18.

Bringing user namespaces to Kubernetes

Although user namespaces are supported in OCI container runtimes like runc, this feature is not available in Kubernetes. Ongoing, unsuccessful, efforts in Kubernetes to add user namespace support date back to 2016 starting with this enhancement issue.

At the time of this first attempt, Kubernetes had already introduced the Container Runtime Interface (CRI) but it was not yet the default. Nowadays, however, the Kubelet communicates with the container runtime via the CRI’s gRPC interface and is the main target to introduce support for user namespaces.

comparison of kubelet: directly to docker; to the CRI; to containerd via CRI

The diagram above shows different architectures for the Kubelet’s communication with the container runtimes:

The old way: the Kubelet talks to Docker directly to start containers using the Docker API (deprecated)
The new way: the Kubelet talks to the container runtime via the CRI’s gRPC interface. The container runtime might have a “CRI shim” component that understands the CRI protocol.
The new way when using containerd as the container runtime: containerd is compiled with containerd/cri to understand the CRI protocol. They receive CRI commands from the Kubelet and start containers using runc and the OCI runtime spec.

CRI changes for user namespaces

The exact changes in the CRI to support user namespaces are still in discussion. But the general idea is the following:

The Kubelet asks the container runtime to start the “pod” sandbox with the RunPodSandbox() method. This will create a “sandbox” container (also known as the “infrastructure” container) that will hold the shared namespaces among the different containers of the pod. The user namespace will be configured at this stage with specific uid and gid mappings, as you can see in the OCI runtime configuration given to runc.

gRPC struct changes needed for User Namespaces

Then, the Kubelet will call the method CreateContainer() for each container of the pod with a reference to the sandbox previously created. The OCI runtime configuration will specify to reuse the user namespace of the sandbox.

gRPC struct showing path to userns reference

In this way, the user namespace will be shared by the different containers in the pod.

The IPC and network namespaces are also shared at the pod level, allowing the different containers to communicate with each other using IPC mechanisms and with the network loopback interface respectively. The IPC network and network namespace are owned by the user namespace of the pod, allowing the capabilities of those namespace types to be effective in the pod but not on the host. For example, if a container is given CAP_NET_ADMIN in the user namespace, it will be allowed to configure the network of the pod but not of the host. If a container is given CAP_IPC_OWNER, it can bypass the permissions of IPC objects in the pod but not on the host.

The mount namespace in each container is also owned by the pod user namespace. Thus, if a container is given CAP_SYS_ADMIN, it will be able to perform mounts in its mount namespace but that capability will not be effective for the host mount namespace because the host mount namespace is not owned by the user namespace of the pod.

diagram of userns, mountns, ipcns, and netns inside a k8s pod

Kubernetes volumes

A big challenge for user namespaces in Kubernetes is support for volumes. I mentioned in the introduction that different containers should ideally have different sets of user IDs so that they have better isolation from each other. But it introduces a problem when the different containers need to access the same volumes.

Consider the scenario below:

Container1 writes files on a NFS share. The files belong to the user ID 100000 because that’s the user mapped in the container.
Container2 reads the files from the NFS share. Since the user ID 100000 is not mapped in the container 2, the files are seen as belonging to the pseudo user id 65534, special code for “nobody”. This can introduce a file access permission problem.

NFS mixing with User Namespace

There are different possibilities to address this problem:

Use the same user id mapping for all pods. This reduces the isolation between containers, although this would still provide better security than the status quo without user namespaces. The files on the volume would be owned by a user id such as 100000 and administrator would need to take care that the user id mapping in the Kubernetes configuration does not change during the lifetime of the volume.
Use different user id mappings for each pod but use an additional mechanism to convert the user id of files on the fly. There are different kernel mechanisms in the works providing that: shiftfs, fsid mappings or new mount API in the Linux kernel. But so far, none of those solutions are ready in Linux upstream.

Our current proposal makes it possible to use the first possibility and it could be extended later when a better solution arises.

Note that there are plenty of workloads without volumes that would benefit from user namespaces today, so not having a complete solution today should not block us from incrementally implementing user namespaces in Kubernetes.

Conclusion

User namespaces is a Linux primitive that is useful for providing an additional layer of security to containers. It has proven to be a useful mitigation in several past vulnerabilities. Although it still suffers for some shortcomings with volumes support, Linux kernel development is active in that area. I expect future improvements to build on the improvements that have been made over the last years.

We’re proud to contribute to that community effort in Kubernetes. Once Kubernetes support for user namespaces is complete, Kubernetes will gain a better security isolation between container workloads and the Linux hosts. It will also open the door for new use cases of running containers with more privileges, something that is today — without user namespace support — too dangerous to do.

What are user namespaces?​

User namespaces and capabilities​

User namespaces and filesystems​

Impact on container security​

Enabling user namespaces for FUSE filesystems​

Bringing user namespaces to Kubernetes​

CRI changes for user namespaces​

Kubernetes volumes​

Conclusion​