Kinvolk is now part of Microsoft. Read more »
article image

systemd is a system and service manager for Linux. It offers a set of security features for sandboxing services in order to limit the set of system resources a service can access. Some of these features include limiting access to resources like memory and CPU, limiting the syscalls that can be used and so on. In this post we’ll show how eBPF is currently used in systemd to implement some of those security features and how supporting libbpf has opened the door to adding new features. We’ll provide details about two of these new features that use eBPF, how they can be used and we’ll dig into the implementation details.

eBPF makes it possible to modify the Linux kernel behaviour – without recompiling it – by loading a user-defined program into it. Those programs are executed upon different kernel events (known as hooks) and based on their return value the kernel modifies its behaviour. For example, an eBPF program could choose to drop a network packet, refuse a system call or record the event for tracing purposes.

systemd uses eBPF for IP filtering and accounting , it also supports custom eBPF programs to filter ingress and egress traffic. The IP filtering programs are written directly in eBPF assembly, which makes them efficient but also harder to maintain and makes implementing new related functionalities unnecessarily difficult. Fortunately, systemd has work-in-progress code to support eBPF programs written in pseudo-C and loaded with libbpf.

Supporting eBPF programs written in pseudo-C is a game changer for eBPF in systemd. This support will speed up the creation of new eBPF-based features as many developers are familiar with C but not with eBPF assembly. Those features will also be easier to maintain. Using a proven library for loading and managing the eBPF objects (maps, programs and so on) like libbpf allows systemd to use advanced features like Compile Once - Run Everywhere, a technology based on BTF (BPF Type Format) that allows running programs on different hosts without worrying about internal changes in kernel structures that could break them.

We’ve implemented two new properties based on eBPF: RestrictFileSystems= and RestrictNetworkInterfaces= .

Note: our work is still in review and the implementation is subject to change based on review comments. We’ll update this blog post when they’re merged.

RestrictFileSystems=

RestrictFileSystems= allows limiting the filesystem types processes in a systemd service have access to. For example, if a systemd service specifies “RestrictFileSystems=ext4 tmpfs”, processes belonging to that service can only access files living in an ext4 or tmpfs filesystem. A deny-list approach is also supported. For example, processes in a systemd unit specifying “RestrictFileSystems=~tracefs” cannot access files on tracefs filesystems. This feature adds an extra layer of security preventing processes from accessing security sensitive filesystems like debugfs or tracefs even if they are running as root.

How to Use

This feature requires a kernel >= 5.7 configured with CONFIG_BPF_LSM, CONFIG_DEBUG_INFO_BTF and the BPF LSM enabled (via CONFIG_LSM="...,bpf" or the "lsm=" kernel boot parameter). The system must also use cgroup2 (unified or hybrid) and libbpf >= 0.2.0 must be installed. The LSM hook is very new and it’s only enabled on kernels of some of the popular distributions, for that reason if you want to use that feature before it lands in the distributions, you’ll have to compile your own kernel with those configuration knobs enabled.

In the following example we’ll use three different paths with different filesystems, let’s look at the (trimmed) output of the mount command

$ mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
/dev/sda1 on / type ext4 (rw,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,nr_inodes=409600)

Let’s start with a unit file that can only access ext4 filesystems and tries to access a file on /tmp (tmpfs).

$ echo "This is my file" > /tmp/myfile
$ cat /lib/systemd/system/myservice.service
[Unit]
Description=My test unit
[Service]
ExecStart=cat /tmp/myfile
RestrictFileSystems=ext4
$ systemctl start myservice
$ journalctl -u myservice
Feb 01 19:20:46 ubuntu-focal systemd[1]: Started My test unit.
Feb 01 19:20:47 ubuntu-focal cat[9672]: cat: /tmp/myfile: Operation not permitted
Feb 01 19:20:47 ubuntu-focal systemd[1]: myservice.service: Main process exited, code=exited, status=1/FAILURE
Feb 01 19:20:47 ubuntu-focal systemd[1]: myservice.service: Failed with result 'exit-code'.
$ systemctl stop myservice

We can see that the unit fails because the cat process inside the unit is not able to access the /tmp/myfile file. Let’s add permissions for the tmpfs filesystem.

$ sudo cat /lib/systemd/system/myservice.service
[Unit]
Description=My test unit
[Service]
ExecStart=cat /tmp/myfile
RestrictFileSystems=ext4 tmpfs
$ systemctl start myservice
$ journalctl -u myservice
Feb 01 19:26:45 ubuntu-focal systemd[1]: Started My test unit.
Feb 01 19:26:45 ubuntu-focal cat[10077]: This is my file
Feb 01 19:26:45 ubuntu-focal systemd[1]: myservice.service: Succeeded.

We can see how in this case, as expected, the cat process inside the service is able to access the file. This example could not be that interesting, let’s try another one where we forbid access to a dangerous kind of file system, like proc.

$ sudo cat /lib/systemd/system/myservice.service
[Unit]
Description=My test unit
[Service]
ExecStart=cat /proc/kcore
RestrictFileSystems=~proc
$ systemctl start myservice
$ journalctl -u myservice
Feb 01 19:47:29 ubuntu-focal systemd[1]: Started My test unit.
Feb 01 19:47:29 ubuntu-focal cat[11284]: cat: /proc/kcore: Operation not permitted
Feb 01 19:47:29 ubuntu-focal systemd[1]: myservice.service: Main process exited, code=exited, status=1/FAILURE
Feb 01 19:47:29 ubuntu-focal systemd[1]: myservice.service: Failed with result 'exit-code'.

The systemd.exec man page will offer more information on the RestrictFileSystems= usage.

Implementation

This feature is implemented by attaching an eBPF program (BPF_PROG_TYPE_LSM) to the file_open BPF LSM hook (BPF_LSM_MAC). The program is attached at boot time and stays there until shutdown. Then, when a service specifying the RestrictFileSystems= property is started, an entry is added to a global BPF hashmap pinned to the BPF filesystem under /sys/fs/bpf/systemd/lsm_bpf_map. The map stores a set of filesystem magic numbers per cgroupID. When a process tries to open a file, the BPF program is executed and checks which cgroup the process is running in: if an entry is present in the global map it checks if the filesystem the process is trying to access is present in the set, and according to the policy, allow-list vs deny-list, access is denied or not.

The eBPF program receives as argument a struct file * containing a struct inode * , that contains struct super_block * and it finally has the superblock magic number . Those structures are internal to the kernel and can change from version to version, i.e. the offset of those fields within each structure could change and a program compiled for a given kernel version won’t work in a different one. The BPF Type Format (BTF) infrastructure offers a solution to this problem. The structures are defined in the eBPF program with the __attribute__((preserve_access_index)) clang attribute that tells the verifier to update the offsets of those fields at load time. This mechanism is also known as Compile Once - Run Everywhere, more details can be found in the BPF Portability and CO-RE blog post.

The following is the full implementation of the eBPF program:

/* SPDX-License-Identifier: GPL-2.0-only */

#include <linux/types.h>
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include <errno.h>
#include <stddef.h>
#include <stdint.h>

struct super_block {
        long unsigned int s_magic;
} __attribute__((preserve_access_index));

struct inode {
        struct super_block *i_sb;
} __attribute__((preserve_access_index));

struct file {
        struct inode *f_inode;
} __attribute__((preserve_access_index));

struct {
        __uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
        __uint(max_entries, 2048);  /* arbitrary */
        __type(key, uint64_t);      /* cgroup ID */
        __type(value, uint32_t);    /* fs magic set */
} cgroup_hash SEC(".maps");

SEC("lsm/file_open")
int BPF_PROG(restrict_filesystems, struct file *file, int ret)
{
        unsigned long magic_number;
        uint64_t cgroup_id;
        uint32_t *value, *magic_map, zero = 0, *is_allow;

        /* ret is the return value from the previous BPF program or 0 if it's
         * the first hook */
        if (ret != 0)
                return ret;

        BPF_CORE_READ_INTO(&magic_number, file, f_inode, i_sb, s_magic);

        cgroup_id = bpf_get_current_cgroup_id();

        magic_map = bpf_map_lookup_elem(&cgroup_hash, &cgroup_id);
        if (!magic_map)
                return 0;

        if ((is_allow = bpf_map_lookup_elem(magic_map, &zero)) == NULL) {
                /* Malformed map, it doesn't include whether it's an allow list
                 * or a deny list. Allow. */
                return 0;
        }

        if (*is_allow) {
                /* Allow-list: Allow access only if magic_number present in inner map */
                if (bpf_map_lookup_elem(magic_map, &magic_number) == NULL)
                        return -EPERM;
        } else {
                /* Deny-list: Allow access only if magic_number is not present in inner map */
                if (bpf_map_lookup_elem(magic_map, &magic_number) != NULL)
                        return -EPERM;
        }

        return 0;
}

Full details of the implementation can be found in the Github Pull Request #18145 introducing this support.

RestrictNetworkInterfaces=

RestrictNetworkInterfaces= allows limiting the network interfaces that processes in a systemd service have access to. For example, if a systemd service specifies RestrictNetworkInterfaces=eth0 eth1, processes belonging to that service can only use the eth0 and eth1 network interfaces, and not others that might be present on the host (like eth2 or bond0). It also supports a deny-list approach. For example, processes in a systemd service specifying RestrictNetworkInterfaces=~bond0 won’t be able to access bond0.

This increases security as it prevents processes from accessing network interfaces they shouldn’t have access to even if they are privileged processes.

How to Use

In order to use this feature, a kernel >= 5.7 with CONFIG_CGROUP_BPF enabled is required. This version was released almost a year ago and most distributions enable that option by default. A systemd version with the feature merged, eBPF support enabled and libbpf >= 0.2.0 installed on the host are the other requirements to make it work.

Before setting the unit, let’s check what the interfaces are and what’s the one used to get access to the internet.

$ ip link ls up
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
	link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
	link/ether 02:7c:95:13:d4:48 brd ff:ff:ff:ff:ff:ff
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
	link/ether 08:00:27:29:7e:64 brd ff:ff:ff:ff:ff:ff

$ ip route
default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15
10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100
192.168.33.0/24 dev enp0s8 proto kernel scope link src 192.168.33.10

We can see that the system has three network interfaces: lo (loopback), enp0s3 used to connect to the internet, and enp0s8 to reach the 192.168.33.0/24 network.

Let’s start a process inside a temporary unit that can only access the enp0s3 interface:

$ sudo systemd-run -t -p RestrictNetworkInterfaces=enp0s3 ping 8.8.8.8
Running as unit: run-u11.service
Press ^] three times within 1s to disconnect TTY.
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=63 time=18.3 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=63 time=32.2 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=63 time=21.5 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=63 time=19.4 ms
^C
--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 18.318/22.848/32.177/5.504 ms

As expected, ping works successfully. Let’s try again giving access to the enp0s8 interface only.

$ sudo systemd-run -t -p RestrictNetworkInterfaces=enp0s8 ping 8.8.8.8
Running as unit: run-u12.service
Press ^] three times within 1s to disconnect TTY.
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
/usr/bin/ping: sendmsg: Operation not permitted
/usr/bin/ping: sendmsg: Operation not permitted
/usr/bin/ping: sendmsg: Operation not permitted
/usr/bin/ping: sendmsg: Operation not permitted
^C
--- 8.8.8.8 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3256ms

In this case, the ping command doesn’t work because the packet is routed through the enp0s3 interface, that is forbidden.

There is no special exception regarding the loopback interface, you explicitly have to give/deny access to it.

For instance, the lo interface is typically needed for DNS resolution.

$ sudo systemd-run -t -p RestrictNetworkInterfaces="enp0s3" ping kinvolk.io
Running as unit: run-u24.service
Press ^] three times within 1s to disconnect TTY.
/usr/bin/ping: kinvolk.io: Temporary failure in name resolution

$ sudo systemd-run -t -p RestrictNetworkInterfaces="enp0s3 lo" ping kinvolk.io
Running as unit: run-u25.service
Press ^] three times within 1s to disconnect TTY.
PING kinvolk.io (172.67.196.142) 56(84) bytes of data.
64 bytes from 172.67.196.142 (172.67.196.142): icmp_seq=1 ttl=63 time=43.7 ms
64 bytes from 172.67.196.142 (172.67.196.142): icmp_seq=2 ttl=63 time=44.6 ms
^C
--- kinvolk.io ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1013ms
rtt min/avg/max/mdev = 43.731/44.183/44.636/0.452 ms

The systemd.resource-control man page will provide more information on the RestrictNetworkInterfaces= usage.

Implementation

This feature uses two eBPF programs (BPF_PROG_TYPE_CGROUP_SKB), attached to ingress (BPF_CGROUP_INET_INGRESS) and egress (BPF_CGROUP_INET_EGRESS). These programs are executed each time a process in the cgroup tries to send or receive a network packet. If the program returns 1 the packet is forwarded, otherwise it’s dropped. The argument of those programs is a struct __sk_buff * , which contains the ifindex field indicating the index of the network interface where the packet is being sent/received.

An eBPF hashmap contains the list of indexes of the network interfaces to deny/allow. The key size is 4 as the interface index is 4-bytes long, the size of the value doesn’t matter as the map is used as a set. The size of the map is set in the control code according to the number of interfaces in the rules before loading the eBPF objects.

The is_allow_list global variable defines whether the map represents the list of network interfaces to allow or deny. There is a lookup using the ifindex member of the struct __sk_buff, then based on whether it’s an allow or deny list, the packet is dropped or forwarded.

The common part of the code for both programs is abstracted in a single function, then both programs are defined. SEC("cgroup_skb/egress") and SEC("cgroup_skb/ingress") allows libbpf to understand the program and attach type and avoids us to specify that again in the control code.

The following is the full implementation of the eBPF part:

/* SPDX-License-Identifier: LGPL-2.1-or-later */

/* <linux/bpf.h> must precede <bpf/bpf_helpers.h> due to integer types
 * in bpf helpers signatures.
 */
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

const volatile __u8 is_allow_list = 0;

/* Map containing the network interfaces indexes.
 * The interpretation of the map depends on the value of is_allow_list.
 */
struct {
        __uint(type, BPF_MAP_TYPE_HASH);
        __type(key, __u32);
        __type(value, __u8);
} ifaces_map SEC(".maps");

#define DROP 0
#define PASS 1

static inline int restrict_network_interfaces_impl(struct __sk_buff *sk) {
        __u32 zero = 0, ifindex;
        __u8 *lookup_result;

        ifindex = sk->ifindex;
        lookup_result = bpf_map_lookup_elem(&ifaces_map, &ifindex);
        if (is_allow_list) {
            /* allow-list: let the packet pass if iface in the list */
            if (lookup_result)
                return PASS;
        } else {
            /* deny-list: let the packet pass if iface *not* in the list */
            if (!lookup_result)
                    return PASS;
        }

        return DROP;
}

SEC("cgroup_skb/egress")
int restrict_network_interfaces_egress(struct __sk_buff *sk)
{
        return restrict_network_interfaces_impl(sk);
}

SEC("cgroup_skb/ingress")
int restrict_network_interfaces_ingress(struct __sk_buff *sk)
{
        return restrict_network_interfaces_impl(sk);
}

char _license[] SEC("license") = "LGPL-2.1-or-later";

The full changes done in the implementation and the associated eBPF code can be found in the Github Pull Request #18385 .

Dependencies Summary

The following table summarizes the minimal dependencies needed to use the two new properties.

Property systemd Kernel Kernel options libbpf
RestrictFileSystems= v249* 5.7 CONFIG_BPF
CONFIG_BPF_SYSCALL
CONFIG_BPF_LSM
CONFIG_DEBUG_INFO_BTF
CONFIG_LSM=”…,bpf"
v0.2.0
RestrictNetworkInterfaces= v249* 5.7 CONFIG_BPF
CONFIG_BPF_SYSCALL
CONFIG_CGROUP_BPF
v0.2.0

Note: we hope to get the PRs merged and have these functionalities available in v249.

Future Work

The introduction of support for eBPF programs in pseudo-C in systemd definitely opens the door for amazing new uses of eBPF there. Part of the near future could be to port the existing security features (IPAddressDeny=, IPAccounting=, IPAddressAllow=) to this approach in order to make it easier to maintain and extend them. Another idea is to implement a full firewall solution that is able to filter by more L4 fields like ports, protocols and flags and not only the L3 address like the current solution.

We would like to thanks Lennart Poettering who provided valuable feedback on the implementation and blog post.

Related Articles