Just before the All Systems Go! conference, we had a BPF Hackfest at the Kinvolk office and one of the topics of discussion was to document different BPF ELF loaders. This blog post is the result of it.
BPF is a new technology in the Linux kernel, which allows running custom code attached to kernel functions, network cards, or sockets amongst others. Since it is very versatile a plethora of tools can be used to work with BPF code: perf record, tc from iproute2, libbcc, etc. Each of these tools has a different focus, but they use the same Linux facilities to achieve their goals. This post documents the steps they use to load BPF into the kernel.
BPF is usually compiled from C, using clang, and “linked” into a single ELF file. The exact format of the ELF file depends on the specific tool, but there are some common points. ELF sections are used to distinguish map definitions and executable code. Each code section usually contains a single, fully inlined function.
The loader creates maps from the definition in the ELF using the
bpf(BPF_MAP_CREATE) syscall and saves the returned file descriptors . This
is where the first complication comes in, because the loader now has to rewrite
all references to a particular map with the file descriptor returned by the
bpf() syscall. It does this by iterating through the symbol and relocation
tables contained in the ELF, which yields an offset into a code section. It
then patches the instruction at that offset to use the correct fd .
After this fixup is done, the loader uses
bpf(BPF_PROG_LOAD) with the patched
bytecode . The BPF verifier resolves map fds to the in-kernel data
structure, and verifies that the code is using the maps correctly. The kernel
rejects the code if it references invalid file descriptors. This means that the
BPF_PROG_LOAD depends on the environment of the calling process.
After the BPF program is successfully loaded, it can be attached to a variety
of kernel subsystems . Some subsystems use a simple syscall (e.g.
SO_ATTACH), while others require netlink messages (XDP) or manipulating the
tracefs (kprobes, tracepoints).
Small differences between BPF ELF loaders
The different loaders offer different features and for that reason use slightly
different conventions in the ELF file. The ELF conventions are not part of the
Linux ABI. It means that an ELF file prepared for one loader usually cannot
just be loaded by another one. The map definition struct (
in the schema) is the main varying part.
|BPF ELF loader \ Features||Maps in maps||Pinning||NUMA node||bpf2bpf function call|
|libbpf (Linux kernel) map def||no||no||Yes (via samples)||Yes|
|Perf map def||no||no||no||yes|
|iproute2 / tc map def||yes||Yes (none, object, global)||no||Yes|
|gobpf map def||Not yet||Yes (none, object, global, custom)||no||no|
There are other varying parts in loader ELF conventions that we found noteworthy: - Some use one ELF section per map, some use one “maps” sections for all the maps. - The naming of the sections and the function entrypoint vary. Some have default section names that can be overriden in the CLI (tc), some requires well-defined prefixes (“kprobe”, “kretprobes/”). - Some use csv-style parameters in the section name (perf), some give an API in Go to programatically change the loader’s behaviour.
BPF is actively developed in the Linux kernel and whenever a new feature is implemented, BPF ELF loader might need an update as well to support it. The different BPF ELF loaders have different focuses and might not add support of all BPF kernel new features at the same speed. There are efforts underway to standardise on libbpf as the canonical implementation. The plan is to ship libbpf with the kernel, which means it will set the de-facto standard for user space BPF support.