Container runtimes like Docker and containerd by default hand-off the task of running a container to runc. The runc 1.1.0 release includes a feature we have been working on: support for seccomp notify . This new feature can be really useful in some specific use cases, particularly for container deployments where you are trying to tightly lock down what a container can do.
For context, seccomp is a Linux kernel feature available since 2.6.12 that allows a Linux system administrator or application developer to decide which syscalls a task can execute, typically used to reduce the attack surface by blocking syscalls not needed by the task. The list of syscalls that are allowed/denied, however, is a static list defined before the task launches.
Seccomp notify is a new Linux feature that provides a way to call into a user space program (seccomp agent) when some syscalls are executed. The agent can, then, decide to do one of three things:
Allow the syscall to proceed and be executed by the calling application (regular kernel permissions checks are performed).
Block the execution of the syscall and return an error code to the calling application instead. This allows the agent to enforce security policies without having the limitations imposed by having to run code in the kernel context (e.g., the agent can make API calls to gather info about the task).
Execute any arbitrary function, including executing the system call the calling application requested, and synthesize a return code to the calling application. This can be useful when the calling application cannot run the syscall but the seccomp agent can (e.g., for security reasons, like the calling application is not running with
CAP_SYS_ADMIN, or because it is running inside a user namespace, etc.).
While seccomp notify is a generic Linux capability, we are particularly interested in its application to container environments and we will cover a few of those possibilities in this blog post.
Applications of seccomp notify for containers
There are several ways that seccomp notify can be useful in the context of running containerized applications. Here are a few examples.
While this has not been done yet, seccomp notify with runc 1.1 can be used to run docker inside of docker. Our friends at Gitpod had to do some manual work that is no longer needed thanks to this new feature.
Writing a seccomp notify agent for containers
If you want to build your own agent to process seccomp notifications for containers, you will have to familiarize yourself with some low-level details to see how to read parameters of syscalls, execute syscalls using the container namespaces, etc.
To make this simpler, we have created a generic application, the Kinvolk Seccomp Agent , that allows you to write your own logic and leverage the packages we wrote to read the syscalls parameters, execute code in another namespace and other things almost all agents might need. The agent is in its early days, but we think it is a great starting point for anyone looking to leverage the new seccomp notify capabilities, and we are happy to take patches that you might develop as you build your own agent based on it.
Before you jump into implementing your own agent, though, there are some important subtleties that you should make sure you understand. For example, it is not safe to create a dynamic user space seccomp policy agent in a generic way. This is explained in depth in this blog post by Christian Brauner, please check it out.
If you are using Flatcar Container Linux and containerd as your container runtime, you can use seccomp notify with version 3127.0.0 or later (if you use docker as your container runtime, the currently shipped version does not yet have this functionality).
If you are not using Flatcar Container Linux, you will need to make sure your Linux environment includes the following:
Runc >= 1.1.0
Libseccomp >= 2.5.0 (>= 2.5.2 recommended)
Linux kernel >= 5.9
Docker from git (needs to include this PR )
Or if you are using containerd instead of docker, containerd >= 1.5.5 (>= 1.6.0-rc.1 recommended)
As we contributed seccomp notify support to the OCI runtime-spec upstream , this change was also implemented by other OCI-compatible runtimes like crun and youki . We have not tested it with them, but it should just work.
Seccomp Notify is a powerful new Linux kernel feature that enables developers and administrators to write user space agents which can filter syscall requests, typically to enforce security rules or execute a privileged operation on behalf of an unprivileged process. It is increasingly supported in container runtimes and orchestrators, and the Kinvolk Seccomp Agent is a great place to get started if you want to delve into the topic more deeply by writing your own agent.
If the above sounds like it might be of interest and you want to learn more, check out this presentation from my colleague Alban:
- Seccomp Notify on Kubernetes: The new Linux superpower coming to a Kubernetes cluster near you! - FOSDEM 2021
Also, if you want to dive deeper into the seccomp notify kernel feature and its usage in other container runtimes, check out these blog posts:
Playing with seccomp notifications in the OCI runtime - Giuseppe Scrivano
The Seccomp Notifier - New Frontiers in Unprivileged Container Development - The aforementioned blog post by Christian Brauner