kinvolk logo | Blog

Improving Kubernetes Security

In summer 2018, the Gardener project team asked Kinvolk to execute several penetration tests in its role as a third-party contractor. We applied our Kubernetes expertise to identify vulnerabilities on Gardener installations and to make recommendations.

Some of our findings are now presented in this article on the Gardener website.

We presented some of our findings in a joint presentation with SAP entitled Hardening Multi-Cloud Kubernetes Clusters as a Service at KubeCon 2018 in Shanghai. The slides in PDF and the video recording are now available.

We also presented it at the Gardener Bi-weekly Meeting, see the agenda for Friday 7 Dec 2018.

If you need help with penetration testing your installation, please contact us at [email protected].

Exploring BPF ELF Loaders at the BPF Hackfest

Just before the All Systems Go! conference, we had a BPF Hackfest at the Kinvolk office and one of the topics of discussion was to document different BPF ELF loaders. This blog post is the result of it.

BPF is a new technology in the Linux kernel, which allows running custom code attached to kernel functions, network cards, or sockets amongst others. Since it is very versatile a plethora of tools can be used to work with BPF code: perf record, tc from iproute2, libbcc, etc. Each of these tools has a different focus, but they use the same Linux facilities to achieve their goals. This post documents the steps they use to load BPF into the kernel.

Common steps

BPF is usually compiled from C, using clang, and “linked” into a single ELF file. The exact format of the ELF file depends on the specific tool, but there are some common points. ELF sections are used to distinguish map definitions and executable code. Each code section usually contains a single, fully inlined function.

The loader creates maps from the definition in the ELF using the bpf(BPF_MAP_CREATE) syscall and saves the returned file descriptors [1]. This is where the first complication comes in, because the loader now has to rewrite all references to a particular map with the file descriptor returned by the bpf() syscall. It does this by iterating through the symbol and relocation tables contained in the ELF, which yields an offset into a code section. It then patches the instruction at that offset to use the correct fd [2].

After this fixup is done, the loader uses bpf(BPF_PROG_LOAD) with the patched bytecode [3]. The BPF verifier resolves map fds to the in-kernel data structure, and verifies that the code is using the maps correctly. The kernel rejects the code if it references invalid file descriptors. This means that the outcome of BPF_PROG_LOAD depends on the environment of the calling process.

After the BPF program is successfully loaded, it can be attached to a variety of kernel subsystems [4]. Some subsystems use a simple syscall (e.g. SO_ATTACH), while others require netlink messages (XDP) or manipulating the tracefs (kprobes, tracepoints).

Small differences between BPF ELF loaders

The different loaders offer different features and for that reason use slightly different conventions in the ELF file. The ELF conventions are not part of the Linux ABI. It means that an ELF file prepared for one loader usually cannot just be loaded by another one. The map definition struct (struct bpf_elf_map in the schema) is the main varying part.

BPF ELF loader \ Features Maps in maps Pinning NUMA node bpf2bpf function call
libbpf (Linux kernel) map def no no Yes (via samples) Yes
Perf map def no no no yes
iproute2 / tc map def yes Yes (none, object, global) no Yes
gobpf map def Not yet Yes (none, object, global, custom) no no
newtools/ebpf yes no no yes

There are other varying parts in loader ELF conventions that we found noteworthy: - Some use one ELF section per map, some use one “maps” sections for all the maps. - The naming of the sections and the function entrypoint vary. Some have default section names that can be overriden in the CLI (tc), some requires well-defined prefixes (“kprobe”, “kretprobes/”). - Some use csv-style parameters in the section name (perf), some give an API in Go to programatically change the loader’s behaviour.

Conclusion

BPF is actively developed in the Linux kernel and whenever a new feature is implemented, BPF ELF loader might need an update as well to support it. The different BPF ELF loaders have different focuses and might not add support of all BPF kernel new features at the same speed. There are efforts underway to standardise on libbpf as the canonical implementation. The plan is to ship libbpf with the kernel, which means it will set the de-facto standard for user space BPF support.

Flatcar Linux is now open to the public

A few weeks ago we announced Flatcar Linux, our effort to create a commercially supported fork of CoreOS’ container Linux. You can find the reasoning for the fork in our FAQ.

Since then we’ve been testing, improving our build process, establishing security procedures, and talking to testers about their experiences. We are now satisfied that Flatcar Linux is a stable and reliable container operating system that can be used in production clusters.

Open to the public

Thus, today we are ready to open Flatcar Linux to the public. Thanks to our testers for testing and providing feedback. We look forward to more feedback and community feedback now that Flatcar is more widely available.

For information about release and signing keys, please see the new Releases and the image signing key pages.

Filing issues or feature requests

You can use the Flatcar repository to file any issue or feature request you may have.

Flatcar Linux documentation

We are also happy to announce the initial release of our Flatcar Documentation. You can find information about installing and running Flatcar there.

Commercial support for Flatcar Linux

In the coming weeks we will be providing details of commercial support for Flatcar Linux. Please contact [email protected] if you are interested in commercial support.

Communication channels

We’ve created a mailing list and IRC channels to facilitate communications between users and developers of Flatcar Linux.

Please join those to talk about Flatcar Linux and discuss any issues or ideas you have. We look forward to hearing from you there!

Flatcar Linux @ Kubecon EU

The Kinvolk team will be on hand at Kubecon EU to discuss Flatcar Linux. Come by booth SU-C23 and say “Hi!”.

Thanks

Flatcar Linux would not exist without Container Linux. Thanks to the CoreOS team for building it and we look forward to continued cooperation with their team.

Please follow Kinvolk and the Flatcar Linux project on twitter to stay informed about commercial support and other Flatcar Linux updates in the coming weeks and months.

Towards unprivileged container builds

Once upon a time, software was built and installed with the classic triptych ./configure, make, make install. The build part with make didn’t need to be run as root, which was, in fact, discouraged.

Later, software started being distributed through package managers and built with rpm or dpkg-buildpackage. Building packages as root was still unnecessary and discouraged. Since rpm or deb packages are just archive files, there shouldn’t be any need for privileged operations to build them. After all, we don’t need the ability to load a kernel module or reconfigure the network to create an archive file.

Why should we avoid building software as root? First, to avoid potential collateral damage to the developer’s machine. Second, to avoid being compromised by potentially untrusted resources. This is especially important for build services where anyone can submit a build job: the administrators of the build service have to protect their services against potentially malicious build submissions.

Nowadays, more and more software in cloud infrastructure is built and distributed as container images. Whether it is a Docker image, an OCI bundle, ACI or another format, this is not so different from an archive file. And yet, the majority of container images are built via a Dockerfile with the Docker Engine, which, along with most of its operations, mostly runs as root.

This makes life difficult for build services that want to offer container builds to users that are not necessarily trusted. How did we dig ourselves into this hole?

Why does docker build need root?

There are two reasons why docker build needs root: the build command requires root for some images and to setup the build container.

Run commands with privileges

Dockerfiles allow executing arbitrary commands inside the container environment that it is building with the “RUN” command. This makes the build very convenient: users can use “apt” on Ubuntu based images to install additional packages and they will not be installed on the host but in the container that is being built. This alone requires root access in the container because “apt” will need to install files in directories that are only writable by root.

Starting the build container

To be able to execute those “RUN” commands in the container, “docker build” needs to start this build container first. To start any container, Docker needs to perform the following privileged operations, among others:

  • Preparing an overlay filesystem. This is necessary to keep track of the changes compared to the base image and requires CAP_SYS_ADMIN to mount.
  • Creating new Linux namespaces (sometimes called “unsharing”): mount namespace, pid namespace, etc. All of them (except one, we will see below) require the CAP_SYS_ADMIN capability.
  • pivot_root or chroot, which also require CAP_SYS_ADMIN or CAP_SYS_CHROOT.
  • Mounting basic filesystems like /proc. The “RUN” command can execute arbitrary shell scripts, which often require a properly set up /proc.
  • Preparing basic device nodes like /dev/null, /dev/zero. This is also necessary for a lot of shell scripts. Depending on how they are prepared, this requires either CAP_MKNOD or CAP_SYS_ADMIN.

Only root can perform these operations:

Operation Capability required Without root?
Mount a new overlayfs CAP_SYS_ADMIN
Create new (non-user) namespace CAP_SYS_ADMIN
Chroot or pivot_root CAP_CHROOT or CAP_SYS_ADMIN
Mount a new procfs CAP_SYS_ADMIN
Prepare basic device nodes CAP_MKNOD or CAP_SYS_ADMIN

This blog post will focus on some of those operations in detail. This is not an exhaustive list. For example, preparing basic device nodes is not covered in this blog post.

Projects similar to docker-build

There are other projects to build docker containers that aim to be unprivileged. Some want to support builds from a Dockerfile.

  • img: Standalone, daemon-less, unprivileged Dockerfile and OCI compatible container image builder.
  • buildah: A tool that facilitates building OCI images
  • kaniko
  • orca-build

They could be a building block for CI services or serverless frameworks which need to build a container image for each function.

Where user namespaces come into play

In the same way that other Linux namespaces restrict the visibility of resources to processes inside the namespace, processes in user namespaces only see a subset of all possible users and groups. In the initial user namespace, there are approximately 4294967296 (2^32) possible users. The range goes from 0, for the superuser or root, to 2^32-1.

uid mappings

When setting up a user namespace, container runtimes allocate a range of uids and specify a uid mapping. The mapping means that uid 0 (root) in the container could be mapped to uid 100000 on the host. Root being relative means that capabilities are always relative to a specific user namespace. We will come back to that.

Nested user namespaces

User namespaces can be nested. The inner namespace will have the same amount (or, usually, fewer) uids than the outer namespace. Not all uids from the outer namespace are mapped, but those which are are mapped in a bijective, one-to-one way.

Unprivileged user namespaces

As opposed to all other kinds of Linux namespaces, user namespaces can be created by an unprivileged user (without CAP_SYS_ADMIN). In this case, the uid mapping is restricted to a single uid. In the example below, uid 1000 on the host is mapped to root (uid 0) in the yellow container.

Once the new unprivileged user namespace is created, the process inside is root from the point of view of the container and therefore it has CAP_SYS_ADMIN, so it could create other kinds of namespaces.

This is a useful building block for our goal of unprivileged container builds.

Operation Capability required Without root?
Mount a new overlayfs CAP_SYS_ADMIN
Create new user namespace No capability required (*)
Create new (non-user) namespace CAP_SYS_ADMIN
Chroot or pivot_root CAP_CHROOT or CAP_SYS_ADMIN
Mount a new procfs CAP_SYS_ADMIN
Prepare basic device nodes CAP_MKNOD or CAP_SYS_ADMIN

(*): No capability is required as long as all of the following is respected: your kernel is built with CONFIG_USER_NS=y your Linux distribution does not add a distro-specific knob to restrict it (sysctl kernel.unprivileged_userns_clone on Arch Linux) your uid mappings respect the restriction mentioned above seccomp is not blocking the unshare system call (as it could be in some Docker profiles)

Each Linux namespace is owned by a user namespace

Each Linux namespace instance, no matter what kind (mount, pid, etc.), has a user namespace owner. It is the user namespace where the process that created it sits. When several kinds of Linux namespaces are created in a single syscall, the newly created user namespace owns the other newly created namespaces.

clone(CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNS);

The ownership of those namespaces is important because for most operations, the kernel will check that when determining whether a process has the proper capability.

In the example below, a process attempts to perform a pivot_root() syscall. To succeed, it needs to have CAP_SYS_ADMIN in the user namespace that owns the mount namespace where the process is located. In other words, having CAP_SYS_ADMIN in a unprivileged user namespaces does not allow you to “escape” the container and get more privileges outside.

This is done in the function may_mount():

ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN);

The function ns_capable() checks if the current process has the CAP_SYS_ADMIN capability within the user namespace that owns the mount namespace (mnt_ns) where the current process is located (current->nsproxy).

So by creating the new mount namespace inside the unprivileged user namespace we could do more. We can check our progress, what we achieved so far:

Operation Capability required Without root?
Mount a new overlayfs CAP_SYS_ADMIN
Create new user namespace No capability required (*)
Create new (non-user) namespace CAP_SYS_ADMIN
Chroot or pivot_root CAP_CHROOT or CAP_SYS_ADMIN
Mount a new procfs CAP_SYS_ADMIN
Prepare basic device nodes CAP_MKNOD or CAP_SYS_ADMIN

What about mounting the new overlayfs?

We’ve seen that pivot_root() can be done without privileges by creating a new mount namespace owned by a new unprivileged user namespace. Isn’t this the same for mounting the new overlayfs? Granted, the mount() syscall is guarded by exactly the same call to ns_capable() that we have seen above for pivot_root(). Unfortunately, that’s not enough.

New mounts vs bind mounts

The mount system call can perform distinct actions:

  • New mounts: this mounts a filesystem that was not mounted before. A block device might be provided if the filesystem type requires one (ext4, vfat). Some filesystems don’t need a block device (FUSE, NFS, sysfs). But in any case, the kernel maintains a struct super_block to keep track of options such as read-only.

  • Bind mounts: a filesystem can be mounted on several mountpoints. A bind mount adds a new mountpoint from an existing mount. This will not create a new superblock but reuse it. The aforementioned “read-only” option can be set at the superblock level but also at the mountpoint level. In the example below, /mnt/data is bind-mounted on /mnt/foo so they share the same superblock. It can be achieved with:

    mount /dev/sdc /mnt/data		# new mount
    mount --bind /mnt/data /mnt/foo	# bind mount
    
  • Change options on an existing mount. This can be superblock options, per-mountpoint options or propagation options (most useful when having several mount namespaces).

Each superblock has a user namespace owner. Each mount has a mount namespace owner. To create a new bind mount, having CAP_SYS_ADMIN in the user namespace that owns the mount namespace where the process is located is normally enough (we’ll see some exceptions later). But creating a new mount in a non-initial user namespace is only allowed in some filesystem types. You can find the list in the Linux git repository with:

$ git grep -nw FS_USERNS_MOUNT

It is allowed in procfs, tmpfs, sysfs, cgroupfs and a few others. It is disallowed in ext4, NFS, FUSE, overlayfs and most of them actually.

So mounting a new overlayfs without privileges for container builds seems impossible. At least with upstream Linux kernels: Ubuntu kernels had for some time the ability to do new mounts of overlayfs and FUSE in an unprivileged user namespace by adding the flag FS_USERNS_MOUNT on those 2 filesystem types along with necessary fixes.

Kinvolk worked with a client to contribute to the upstreaming effort of the FUSE-part of patches. Once everything is upstream, we will be able to mount overlayfs.

The FUSE mount will be upstreamed first, before the overlayfs. At that point, overlayfs could theoretically be re-implemented in userspace with a FUSE driver.

Operation Capability required Without root?
Mount a new overlayfs CAP_SYS_ADMIN ✅ (soon)
Create new user namespace No capability required (*)
Create new (non-user) namespace CAP_SYS_ADMIN
Chroot or pivot_root CAP_CHROOT or CAP_SYS_ADMIN
Mount a new procfs CAP_SYS_ADMIN
Prepare basic device nodes CAP_MKNOD or CAP_SYS_ADMIN

What about procfs?

As noted above, procfs has the FS_USERNS_MOUNT flag so it is possible to mount it in an unprivileged user namespace. Unfortunately, there are other restrictions which block us in practice in Docker or Kubernetes environments.

What are locked mounts?

To explain locked mounts, we’ll first have a look at systemd’s sandboxing features. It has a feature to run services in a different mount namespace so that specific files and directories are read-only (ReadOnlyPaths=) or inaccessible (InaccessiblePaths=). The read-only part is implemented by bind-mounting the file or directory over itself and changing the mountpoint option to read-only. The inaccessible part is done by bind-mounting an empty file or an empty directory on the mountpoint, hiding what was there before.

Using bind mounts as a security measure to make files read-only or inaccessible is not unique to systemd: container runtimes do the same. This is only secure as long as the application cannot umount that bind mount or move it away to see what was hidden under it. Both umount and moving a mount away (MS_MOVE) can be done with CAP_SYS_ADMIN, so systemd documentation suggests to not give that capability to a service if such sandboxing features were to be effective. Similarly, Docker and rkt don’t give CAP_SYS_ADMIN by default.

We can imagine another way to circumvent bind mounts to see what’s under the mountpoint: using unprivileged user namespaces. Applications don’t need privileges to create a new mount namespace inside a new unprivileged user namespace and then have CAP_SYS_ADMIN there. Once there, what’s preventing the application from removing the mountpoint with CAP_SYS_ADMIN? The answer is that the kernel detects such situations and marks mountpoints inside a mount namespace owned by an unprivileged user namespace as locked (flag MNT_LOCK) if they were created while cloning the mount namespace belonging to a more privileged user namespace. Those cannot be umounted or moved.

Let me describe what’s in this diagram:

  • On the left: the host mount namespace with a /home directory for Alice and Bob.

  • In the middle: a mount namespace for a systemd service that was started with the option “ProtectHome=yes”. /home is masked by a mount, hiding the alice and bob subdirectories.

  • On the right: a mount namespace created by the aforementioned systemd service, inside a unprivileged user namespace, attempting to umount /home in order to see what’s under it. But /home is a locked mount, so it cannot be unmounted there.

The exception of procfs and sysfs

The explanation about locked mounts is valid for all filesystems, including procfs and sysfs but that’s not the full story. Indeed, in the build container, we normally don’t do a bind mount of procfs but a new mount because we are inside a new pid namespace, so we want a new procfs that reflects that.

New mounts are normally independent from each other, so a masked path in a mount would not prevent another new mount: if /home is mounted from /dev/sdb and has masked paths, it should not influence /var/www mounted from /dev/sdc in any way.

But procfs and sysfs are different: some files there are singletons: for example, the file /proc/kcore refers to the same kernel object, even if it is accessed from different mounts. Docker masks the following files in /proc:

$ sudo docker run -ti --rm busybox mount | grep /proc/
proc on /proc/asound type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/kcore type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/latency_stats type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)

The capability needed to circumvent the restriction on those files is normally CAP_SYS_ADMIN (for e.g. umount). To prevent a process without CAP_SYS_ADMIN from accessing those masked files by mounting a new procfs mount inside a new unprivileged user namespace and new mount namespace, the kernel uses the function mount_too_revealing() to check that procfs is already fully visible. If not, the new procfs mount is denied.

Protected by Protection applies for filesystem types
Bind mounts Locked mounts (MNT_LOCK) all
New mounts mount_too_revealing() procfs and sysfs

This is blocking us from mounting procfs from within a Kubernetes pod.

Several workarounds are possible:

  • Avoid mounting procfs in the build environment and update Dockerfiles that depend on it.
  • Using a Kubernetes container with privileges, so that /proc in the Docker container is not covered. A “rawproc” option in Kubernetes is being discussed with the underlying implementation in moby.
  • Changing the kernel to allow a new procfs mount in an unprivileged user namespace, even when the parent proc mount is not fully visible, but with the same masks in the child proc mount. I started this discussion in a RFC patch and there is an alternative proposal by Djalal Harouni to fix procfs more generally.

Conclusion

As you can see there are a lot of moving parts, as is the general case with Linux containers. But this is an area where development is quite active at the moment and hope for progress is greater than it has ever been. This blog post explored some aspects of the underlying mechanisms on Linux that are being worked on for unprivileged container builds: user namespaces, mounts, some filesystems. We hope to bring you updates about unprivileged container builds in the future and especially about our own involvement in these efforts.

Kinvolk’s offerings

Kinvolk is an engineering team based in Berlin working on Linux, Containers and Kubernetes. We combine our expertise of low-level Linux details like capabilities, user namespaces and the details of FUSE with our expertise of Kubernetes to offer specialised services for your infrastructure that goes all the way down the stack. Contact us at [email protected] to learn more about what Kinvolk does.

Announcing the Flatcar Linux project

Today Kinvolk announces Flatcar Linux, an immutable Linux distribution for containers. With this announcement, Kinvolk is opening the Flatcar Linux project to early testers. If you are interested in becoming a tester and willing to provide feedback, please let us know.

Flatcar Linux is a friendly fork of CoreOS’ Container Linux and as such, compatible with it. It is independently built, distributed and supported by the Kinvolk team.

Why fork Container Linux?

At Kinvolk, we provide support and engineering services for foundational open-source Linux projects used in cloud infrastructure. Last year we started getting inquiries about providing support for Container Linux. Since those inquiries, we had been thinking about how we could offer such support.

When we are typically asked to provide support for projects that we do not maintain–a common occurence–, the process is rather simple. We work with the upstream maintainers to evaluate whether a change would be acceptable and attempt to get that work into the upstream project. If that change is not acceptable to the upstream project and a client needs it, we can create a patch set that we maintain and provide our own release builds. Thus, it is straightforward to provide commercial support for upstream projects.

Providing commercial support for a Linux distribution is more difficult and can not be done without having full control over the means of building, signing and delivering the operating system images and updates. Thus, our conclusion was that forking the project would be required.

Why now?

With the announcement of Red Hat’s acquisition of CoreOS, many in the cloud native community quickly asked, “What is going to happen to Container Linux?” We were pleased when Rob announced Red Hat’s commitment to maintaining Container Linux as a community project. But these events bring up two issues that Flatcar Linux aims to address.

The strongest open source projects have multiple commercial vendors that collaborate together in a mutually beneficial relationship. This increases the bus factor for a project. Container Linux has a bus factor of 1. The introduction of Flatcar Linux brings that to 2.

While we are hopeful that Red Hat is committed to maintaining Container Linux as an open source project, we feel that it is important that open source projects, especially those that are at the core of your system, have strong commercial support.

Road to general availability

Over the next month or so, we will be going through a testing phase. We will focus on responding to feedback that we receive from testers. We will also concentrate on improving processes and our build and delivery pipeline. Once the team is satisfied that the release images are working well and we are able to reliably deliver images and updates, we will make the project generally available. To receive notification when this happen, sign up for project updates.

How can I help?

We are looking for help testing builds and providing feedback. Let us know if you’d be able to test images here.

We are also looking for vendors that could donate caching, hosting and other infrastructure services to the project. You can contact us about this at [email protected].

More information

For more information, please see the project FAQ.

Follow Flatcar Linux and Kinvolk on Twitter to get updates about the Flatcar Linux project.

Kinvolk is now a Kubernetes Certified Service Provider

The Kinvolk team is proud to announce that we are now a Kubernetes Certified Service Provider. We join an esteemed group of organizations that provide valuable services to the Kubernetes community.

Kubernetes Certified Service Providers are vetted service companies that have at least 3 Certified Kubernetes Administrators on staff, have a track record of providing development and operation services to companies, and have that work used in production.

At Kinvolk, we have collaborated with leading companies in the cloud-native community to help build cloud infrastructure technologies that integrate optimally with Linux. Companies come to Kinvolk because of our unique mix of core Linux knowledge combined with well-documented experience in applying that knowledge to modern cloud infrastructure projects. We look forward to continuing such collaborations with more partners in the Kubernetes community.

To learn more about how our team can help you build or improve your products, or the open source projects you rely on, contact us at [email protected].

Follow us on Twitter to get updates on what Kinvolk is up to.

Timing issues when using BPF with virtual CPUs

Introduction

After implementing the collecting of TCP connections using eBPF in Weave Scope (see our post on the Weaveworks blog) we faced an interesting bug that happened only in virtualized environments like AWS, but not on bare metal. The events retrieved via eBPF seemed to be received in the wrong chronological order. We are going to use this bug as an opportunity to discuss some interesting aspects of BPF and virtual CPUs (vCPUs).

Background

Let’s describe in more detail the scenario and provide some background on Linux clocks.

Why is chronological order important for Scope?

Scope provides a visualization of network connections in distributed systems. To do this, Scope needs to maintain a list of current TCP connections. It does so by receiving TCP events from the kernel via the eBPF program we wrote, tcptracer-bpf. Scope can receive either TCP connect, accept, or close events and update its internal state accordingly.

If events were to be received in the wrong order–a TCP close before a TCP connect–Scope would not be able to make sense of the events; the first TCP close would not match any existing connection that Scope knows of, and the second TCP connect would add a connection in the Scope internal state that will never be removed.

TCP events sent from kernel space to userspace

TCP events sent from kernel space to userspace

How events are transferred from kernel to the Scope process?

Context switches and kernel/userspace transitions can be slow and we need an efficient way to transfer a large number of events. This is achieved using a perf ring buffer. A ring buffer or a circular buffer is a data structure that allows a writer to send events to a reader asynchronously. The perf subsystem in the Linux kernel has a ring buffer implementation that allows a writer in the kernel to send events to a reader in userspace. It is done without any expensive locking mechanism by using well-placed memory barriers.

On the kernel side, the BPF program writes an event in the ring buffer with the BPF helper function bpf_perf_event_output(), introduced in Linux 4.4. On the userspace side, we can read the events either from an mmaped memory region (fast), or from a bpf map file descriptor with the read() system call (slower). Scope uses the fast method.

However, as soon as the computer has more than one CPU, several TCP events could happen simultaneously; one per CPU for example. This means there could be several writers at the same time and we will not be able to use a single ring buffer for everything. The solution is simple; use a different ring buffer for each CPU. On the kernel side, each CPU will write into its own ring buffer and the userspace process can read sequentially from all ring buffers.

TCP events traveling through ring buffers.

TCP events traveling through ring buffers.

Multiple ring buffers introduces out-of-order events

Each ring buffer is normally ordered chronologically as expected because each CPU writes the events sequentially into the ring buffer. But on a busy system, there could be several events pending in each ring buffer. When the user-space process picks the events, at first it does not know whether the event from ring buffer cpu#0 happened before or after the event from ring buffer cpu#1.

Adding timestamps for sorting events

Fortunately, BPF has a simple way to address this: a bpf helper function called bpf_ktime_get_ns() introduced in Linux 4.1 gives us a timestamp in nanoseconds. The TCP event written on the ring buffer is a struct. We simply added a field in the struct with a timestamp. When the userspace program receives events from different ring buffers, we sort the events according to the timestamp.

The BPF program (in yellow) executed by a CPU calls two BPF helper functions: bpf_ktime_get_ns() and bpf_perf_event_output()

The BPF program (in yellow) executed by a CPU calls two BPF helper functions: bpf_ktime_get_ns() and bpf_perf_event_output()

Sorting and synchronization

Sorting is actually not that simple because we don’t just have a set of events to sort. Instead, we have a dynamic system where several sources of events are continuously giving the process new events. As a result when sorting the events received at some point in time, there could be a scenario where we receive a new event that has to be placed before the events we are currently sorting. This is like sorting a set without knowing the complete set of items to sort.

To solve this problem, Scope needs a means of synchronization. Before we start gathering events and sorting, we measure the time with clock_gettime(). Then, we read events from all the ring buffers but stop processing a ring buffer if it is empty or if it gives us an event with a timestamp after the time of clock_gettime(). It is done in this way so as to only sort the events that are emitted before the beginning of the collection. New events will only be sorted in the next iteration.

A word on different clocks

Linux has several clocks as you can see in the clock_gettime() man page. We need to use a monotonic clock, otherwise the timestamp from different events cannot be compared meaningfully. Non-monotonicity can come from clock updates from NTP, updates from other software (Google clock skew daemon), timezones, leap seconds, and other phenomena.

But also importantly, we need to use the same clock in the events (measured with the BPF helper function bpf_ktime_get_ns) and the userspace process (with system call clock_gettime), since we compare the two clocks. Fortunately, the BPF helper function gives us the equivalent of CLOCK_MONOTONIC.

Bugs in the Linux kernel can make the timestamp wrong. For example, a bug was introduced in 4.8 but was backported to older kernels by distros. The fix was included in 4.9 and also backported. For example, in Ubuntu, the bug was introduced in kernel 4.4.0-42 and it’s not fixed until kernel 4.4.0-51.

The problem with vCPUs

The above scenario requires strictly reliable timing. But vCPUs don’t make this straight-forward.

Events are still unordered sometimes

Despite implementing all of this, we still sometimes noticed that events were ordered incorrectly. It happened rarely, like once every few days, and only on EC2 instances–not on bare-metal. What explains the difference of behaviour between virtualized environments and bare-metal?

To understand the difference, we’ll need to take a closer look at the source code. Scope uses the library tcptracer-bpf to load the BPF programs. The BPF programs are actually quite complex because they need to handle different cases: IPv4 vs IPv6, the asynchronous nature of TCP connect and the difficulty of passing contextes between BPF functions. But, for the purpose of this race, we can simplify it to two function calls: bpf_ktime_get_ns() to measure the time bpf_perf_event_output() to write the event–including the timestamp–to the ring buffer

The way it was written, we assumed that the time between those two functions was negligible or at least constant. But in virtualized environments, virtual CPUs (vCPU) can randomly sleep, even inside BPF execution in kernel, depending on the hypervisor scheduling. So the time a BPF program takes to complete can vary from one execution to another.

Consider the following diagram:

Two CPUs executing the same BPF function concurrently

Two CPUs executing the same BPF function concurrently

With a vCPU, we have no guarantees with respect to how long a BPF program will take between the two function calls–we’ve seen up to 98ms. It means that the userspace program does not have a guarantee that it will receive all the events before a specific timestamp.

In effect, this means we can not rely on absolute timing consistency on virtualization environments. This, unfortunately, means implementers must take such a scenario into consideration.

Possible fixes

Any solution would have to ensure that the user-space Scope process waits enough time to have received the events from the different queues up to a specific time. One suggested solution was to regularly generate synchronization events on each CPU and deliver them on the same path in the ring buffers. This would ensure that one CPU is not sleeping for a long time without handling events.

But due to the difficulty of implementation and the rarity of the issue, we implemented a workaround by just detecting when the problem happens and restarting the BPF engine in tcptracer-bpf.

Conclusion

Investigating this bug and writing workaround patches for it made us write a reproducer using CPU affinity primitives (taskset) and explore several complex aspects of Linux systems: virtual CPUs in hypervisors, clocks, ring buffers, and of course eBPF.

We’d be interested to hear from others who have encountered such issues with vCPUs and especially those who have additional insight or other ideas for proper fixes.


Kinvolk is available for hire for Linux and Kubernetes based projects

Follow us on Twitter to get updates on what Kinvolk is up to.

Join the Kinvolk team at FOSDEM 2018!

FOSDEM, the premier European open source event that takes place in Brussels, is right around the corner! Most of the Kinvolk team is heading there for a collaborative weekend, with three of our engineers giving talks.

Kinvolk Talk Schedule

Sunday, February 4, 2018

  • 10:00 - 10:25: Zeeshan Ali, “Rust memory management
    Zeeshan, software engineer at Kinvolk, will give a quick introduction to memory management concepts of Rust, a system of programming language that focuses on safety and performance simultaneously.

  • 11:30 - 11:50: Iago López Galeiras, “State of the rkt container runtime and its Kubernetes integration
    Iago, technical lead & co-founder at Kinvolk, will be diving into rkt container runtime and its Kubernetes integration, specifically looking at the progress of rkt and rktlet and the Kubernetes CRI implementation of rkt.

  • 15:05 - 15:25: Alban Crequy, “Exploring container image distribution with casync
    Alban, CTO and co-founder at Kinvolk, will explore container image distribution with casync, a content-addressable data synchronization tool.

We’re looking forward to seeing old friends and making new ones.

Follow us on Twitter to see what we are up to at the conference!

Automated Build to Kubernetes with Habitat Builder

Introduction

Imagine a set of tools which allows you to not only build your codebase automatically each time you apply new changes but also to deploy to a cluster for testing, provided the build is successful. Once the smoke tests pass or the QA team gives the go ahead, the same artifact can be automatically deployed to production.

In this blog post we talk about such an experimental pipeline that we’ve built using Habitat Builder and Kubernetes. But first, let’s look at the building blocks.

What is Habitat and Habitat Builder?

Habitat is a tool by Chef that allows one to automate the deployment of applications. It allows developers to package their application for multiple environments like a container runtime or a VM.

One of Habitat’s components is Builder. It uses a plan.sh file, which is part of the application codebase, to build a Habitat package out of it. A plan.sh file for Habitat is similar to what a Dockerfile is to Docker, and like Docker, it outputs a Habitat artifact that has a .hart extension.

Habitat also has a concept called channels which are similar to tags. By default, a successful build is tagged under the unstable channel and users can use the concept of promotion to promote a specific build of a package to a different channel like stable, staging or production. Users can choose channel names for themselves and use the hab pkg promote command to promote a package to a specific channel.

Please check out the tutorials on the Habitat site for a more in-depth introduction to Habitat.

Habitat ❤ Kubernetes

Kubernetes is a platform that runs containerized applications and supports container scheduling, orchestration, and service discovery. Thus, while Kubernetes does the infrastructure management, Habitat manages the application packaging and deployment.

We will take a look at the available tools that help us integrate Habitat in a functioning Kubernetes cluster.

Habitat Operator

A Kubernetes Operator is an abstraction that takes care of running a more complex piece of software. It leverages the Kubernetes API, and manages and configures the application by hiding the complexities away from the end user. This allows a user to be able to focus on using the application for their purposes instead of dealing with deployment and configuration themselves. The Kinvolk team built a Habitat Operator with exactly these goals in mind.

Habitat Kubernetes Exporter

Recently, a new exporter was added to Habitat by the Kinvolk team that helps in integrating Habitat with Kubernetes. It creates and uploads a Docker image to a Docker registry, and returns a manifest that can be applied with kubectl. The output manifest file can be specified as a command line argument and it also accepts a custom Docker registry URL. This blog post covers this topic in more depth along with a demo at the end.

Automating Kubernetes Export

Today we are excited to show you a demo of a fully automated Habitat Builder to Kubernetes pipeline that we are currently working on together with the Habitat folks:

The video shows a private Habitat Builder instance re-building the it-works project, exporting a Docker image to Docker Hub and automatically deploying it to a local Kubernetes cluster through the Habitat Operator. Last but not leat, the service is promoted from unstable to testing automatically.

In the future, Kubernetes integration will allow you to set up not only seamless, automated deploys but also bring Habitat’s service promotion to services running in Kubernetes. Stay tuned!

If you want to follow our work (or set up the prototype yourself), you can find a detailed README here.

Conclusion

This is an exciting start to how both Habitat and Kubernetes can complement each other. If you are at KubeCon, stop by at the Habitat or Kinvolk booth to chat about Habitat and Kubernetes. You can also find us on the Habitat slack in the #general or #kubernetes channels.

Get started with Habitat on Kubernetes

Habitat is a project that aims to solve the problem of building, deploying and managing services. We at Kinvolk have been working on Kubernetes integration for Habitat in cooperation with Chef. This integration comes in the form of a Kubernetes controller called Habitat operator. The Habitat operator allows cluster administrators to fully utilize Habitat features inside their Kubernetes clusters, all the while maintaining high compatibility with the “Kubernetes way” of doing things. For more details about Habitat and the Habitat operator have a look at our introductory blog post.

In this guide we will explain how to use the Habitat operator to run and manage a Habitat-packaged application in a Kubernetes cluster on Google Kubernetes Engine (GKE). This guide assumes a basic understanding of Kubernetes.

We will deploy a simple web application which displays the number of times the page has been accessed.

Prerequisites

We’re going to assume some initial setup is done. For example, you’ll need to have created an account on Google Cloud Platform and have already installed and configured the Google Cloud SDK as well as its beta component. Lastly, you’ll want to download kubectl so you can connect to the cluster.

Creating a cluster

To start, we’ll want to create a project on GCP to contain the cluster and all related settings. Project names are unique on GCP, so use one of your choosing in the following commands.

Create it with:

$ gcloud projects create habitat-on-kubernetes

We will then need to enable the “compute API” for the project we’ve just created. This API allows us to create clusters and containers.

$ gcloud service-management enable container.googleapis.com --project habitat-on-kubernetes

We also need to enable billing for our project, since we’re going to spin up some nodes in a cluster:

$ gcloud beta billing projects link hab-foobar --billing-account=$your-billing-id

Now we’re ready to create the cluster. We will have to choose a name and a zone in which the cluster will reside. You can list existing zones with:

$ gcloud compute zones list --project habitat-on-kubernetes

This following command sets the zone to “europe-west1-b” and the name to “habitat-cluster”. This command can take several minutes to complete.

$ gcloud container clusters create habitat-demo-cluster --project habitat-on-kubernetes --zone europe-west1-b

Deploying the operator

The next step is to deploy the Habitat operator. This is a component that runs in your cluster, and reacts to the creation and deletion of Habitat Custom Objects by creating, updating or deleting resources in the cluster. Like all objects, operators are deployed with a yaml manifest file. The contents of the manifest file are shown below:.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: habitat-operator
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: habitat-operator
    spec:
      containers:
      - name: habitat-operator
        image: kinvolk/habitat-operator:v0.2.0

From the root of our demo application, we can then deploy the operator with:

kubectl create -f kubernetes/habitat-operator.yml

Deploying the demo application

With that done, we can finally deploy our demo application:

apiVersion: habitat.sh/v1
kind: Habitat
metadata:
  name: habitat-demo-counter
spec:
  image: kinvolk/habitat-demo-counter
  count: 1
  service:
    topology: standalone
---
apiVersion: v1
kind: Service
metadata:
  name: front
spec:
  selector:
    habitat-name: habitat-demo-counter
  type: LoadBalancer
  ports:
  - name: web
    targetPort: 8000
    port: 8000
    protocol: TCP

Just run the following command:

$ kubectl create -f kubernetes/habitat-demo-counter.yml

We can monitor the status of our deployment with kubectl get pod -w. Once all pods are in the “Running” state, our application is fully deployed and ready to interact with.

Let’s find out the public IP address of our application by running kubectl get services front. The IP will be listed under the column “External IP”.

Let’s test it out by going to the service’s IP and port 8000, where we should see the app’s landing page, with the view counter. The counter increases every time we refresh the page, and can be reset with the “Reset” button.

To see this in action, watch the video below.

The Ruby web application has been packaged with Habitat, and is now running as a Habitat service in a Docker container deployed on Kubernetes. Congratulations!

What's new in kube-spawn

There’s been a number of changes in kube-spawn kube-spawn since we announced it.

The main focus of the recent developments was improving the CLI, supporting several clusters running in parallel, and enabling developers to test Kubernetes patches easily. In addition, we’ve added a bunch of documentation, improved error messages and, of course, fixed a lot of bugs.

CLI redesign

We’ve completely redesigned the CLI commands used to interact with kube-spawn. You can now use create to generate the cluster environment, and then start to boot and provision the cluster. The convenience up command does the two steps in one so you can quickly get a cluster with only one command.

Once a cluster is up and running you can use stop to stop it and keep it there to start it again later, or restart to stop and start the cluster.

The command destroy will take a cluster in the stopped or running state and remove it completely, including any disk space the cluster was using.

The following diagram provides a visualization of the CLI workflow.

Multi-cluster support

Previously, users could only run one cluster at the time. With the flag --cluster-name flag, running multiple clusters in parallel is now possible.

All the CLI operations can take --cluster-name to specify which cluster you’re referring to. To see your currently created clusters, a new command list was added to kube-spawn.

This is especially useful when you want to test how your app behaves in different Kubernetes versions or, as a Kubernetes developer, when you made a change to Kubernetes itself and want to compare a cluster without changes and another with your change side-by-side. Which leads us to the next feature.

Dev workflow support

kube-spawn makes testing changes to Kubernetes really easy. You just need to build your Hyperkube Docker image with a particular VERSION tag. Once that’s built, you need to start kube-spawn with the --dev flag, and set --hyperkube-tag to the same name you used when building the Hyperkube image.

Taking advantage of the aforementioned multi-cluster support, you can build current Kubernetes master, start a cluster with --cluster-name=master, build Kubernetes your patch, and start another cluster with --cluster-name=fix. You’ll now have two clusters to check how your patch behaves in comparison with an unpatched Kubernetes.

You can find a detailed step-by-step example of this in kube-spawn’s documentation.

kube-spawn, a certified Kubernetes distribution

certified kubernetes

We’ve successfully run the Kubernetes Software Conformance Certification tests based on Sonobuoy for Kubernetes v1.7 and v1.8. We’ve submitted the results to CNCF and they merged our PRs. This means kube-spawn is now a certified Kubernetes distribution.

Conclusion

With the above additions, we feel like kube-spawn is one of the best tools for developing on Linux with, and on, Kubernetes.

If you want to try it out, we’ve just released kube-spawn v0.2.1. We look forward to your feedback and welcome issues or PRs on the Github project.

Introducing the Habitat Kubernetes Exporter

At Kinvolk, we’ve been working with the Habitat team at Chef to make Habitat-packaged applications run well in Kubernetes.

The first step on this journey was the Habitat operator for Kubernetes which my colleague, Lili, already wrote about. The second part of this project —the focus of this post— is to make it easier to deploy Habitat apps to a Kubernetes cluster that is running the Habitat operator.

Exporting to Kubernetes

To that end, we’d like to introduce the Habitat Kubernetes exporter.

The Kubernetes exporter is an additional command line subcommand to the standard Habitat CLI interface. It leverages the existing Docker image export functionality and, additionally, generates a Kubernetes manifest that can be deployed to a Kubernetes cluster running the Habitat operator.

The command line for the Kubernetes exporter is:

$ hab pkg export kubernetes ORIGIN/NAME

Run hab pkg export kubernetes --help to see the full list of available options and general help.

Demo

Let’s take a look at the Habitat Kubernetes exporter in action.

As you can see, the Habitat Kubernetes exporter helps you to deploy your applications that are built and packaged with Habitat on a Kubernetes cluster by generating the needed manifest files.

More to come

We’ve got more exciting ideas for making Habitat and Habitat Builder work even more seamlessly with Kubernetes. So stay tuned for more.

Kubernetes The Hab Way

How does a Kubernetes setup the Hab(itat) way look? In this blog post we will explore how to use Habitat’s application automation to set up and run a Kubernetes cluster from scratch, based on the well-known “Kubernetes The Hard Way” manual by Kelsey Hightower.

A detailed README with step-by-step instructions and a Vagrant environment can be found in the Kubernetes The Hab Way repository.

Kubernetes Core Components

To recap, let’s have a brief look on the building blocks of a Kubernetes cluster and their purpose:

  • etcd, the distributed key value store used by Kubernetes for persistent storage,
  • the API server, the API frontend to the cluster’s shared state,
  • the controller manager, responsible for ensuring the cluster reflects the configured state,
  • the scheduler, responsible for distributing workloads on the cluster,
  • the network proxy, for service network configuration on cluster nodes and
  • the kubelet, the “primary node agent”.

For each of the components above, you can now find core packages on bldr.habitat.sh. Alternatively, you can fork the upstream core plans or build your own packages from scratch. Habitat studio makes this process easy.

By packing Kubernetes components with Habitat, we can use Habitat’s application delivery pipeline and service automation, and benefit from it as any other Habitat-managed application.

Deploying services

Deployment of all services follows the same pattern: first loading the service and then applying custom configuration. Let’s have a look at the setup of etcd to understand how this works in detail:

To load the service with default configuration, we use the hab sup subcommand:

$ sudo hab sup load core/etcd --topology leader

Then we apply custom configuration. For Kubernetes we want to use client and peer certificate authentication instead of autogenerated SSL certificates. We have to upload the certificate files and change the corresponding etcd configuration parameters:

$ for f in /vagrant/certificates/{etcd.pem,etcd-key.pem,ca.pem}; do sudo hab file upload etcd.default 1 "${f}"; done

$ cat /vagrant/config/svc-etcd.toml
etcd-auto-tls = "false"
etcd-http-proto = "https"

etcd-client-cert-auth = "true"
etcd-cert-file = "files/etcd.pem"
etcd-key-file = "files/etcd-key.pem"
etcd-trusted-ca-file = "files/ca.pem"

etcd-peer-client-cert-auth = "true"
etcd-peer-cert-file = "files/etcd.pem"
etcd-peer-key-file = "files/etcd-key.pem"
etcd-peer-trusted-ca-file = "files/ca.pem"

$ sudo hab config apply etcd.default 1 /vagrant/config/svc-etcd.toml

Since service configuration with Habitat is per service group, we don’t have to do this for each member instance of etcd. The Habitat supervisor will distribute the configuration and files to all instances and reload the service automatically.

If you follow the step-by-step setup on GitHub, you will notice the same pattern applies to all components

Per-instance configuration

Sometimes, each instance of a service requires custom configuration parameters or files, though. With Habitat all configuration is shared within the service group and it’s not possible to provide configuration to a single instance only. For this case we have to fall back to “traditional infrastructure provisioning”. Also, all files are limited to 4096 bytes which is sometimes not enough.

In the Kubernetes setup, each kubelet needs a custom kubeconfig, CNI configuration, and a personal node certificate. For this we create a directory (/var/lib/kubelet-config/) and place the files there before loading the service. The Habitat service configuration then points to files in that directory:

$ cat config/svc-kubelet.toml
kubeconfig = "/var/lib/kubelet-config/kubeconfig"

client-ca-file = "/var/lib/kubelet-config/ca.pem"

tls-cert-file = "/var/lib/kubelet-config/node.pem"
tls-private-key-file = "/var/lib/kubelet-config/node-key.pem"

cni-conf-dir = "/var/lib/kubelet-config/cni/"

Automatic service updates

If desired, Habitat services can be automatically updated by the supervisor once a new version is published on a channel by loading a service with --strategy set to either at-once or rolling. By default, automatic updates are disabled. With this, Kubernetes components can be self-updating. An interesting topic that could be explored in the future.

Conclusion

We have demonstrated how Habitat can be used to build, setup, and run Kubernetes cluster components in the same way as any other application.

If you are interested in using Habitat to manage your Kubernetes cluster, keep an eye on this blog and the “Kubernetes The Hab Way” repo for future updates and improvements. Also, have a look at the Habitat operator, a Kubernetes operator for Habitat services, that allows you to run Habitat services on Kubernetes.

Announcing the Initial Release of rktlet, the rkt CRI Implementation

We are happy to announce the initial release of rktlet, the rkt implementation of the Kubernetes Container Runtime Interface. This is a preview release, and is not meant for production workloads.

When using rktlet, all container workloads are run with the rkt container runtime.

About rkt

The rkt container runtime is unique amongst container runtimes in that, once rkt is finished setting up the pod and starting the application, no rkt code is left running. rkt also takes a security-first approach, not allowing insecure functionality unless the user explicitly disables security features. And rkt is pod-native, matching ideally with the Kubernetes concept of pods. In addition, rkt prefers to integrate and drive improvements into existing tools, rather than reinvent things. And lastly, rkt allows for running apps in various isolation environments — container, VM or host/none.

rkt support in Kubernetes

With this initial release of rktlet, rkt currently has two Kubernetes implementations. Original rkt support for Kubernetes was introduced in Kubernetes version 1.3. That implementation — which goes by the name rktnetes — resides in the core of Kubernetes. Just as rkt itself kickstarted the drive towards standards in containers, this original rkt integration also spurred the introduction of a standard interface within Kubernetes to enable adding support for other container runtimes. This interface is known as the Kubernetes Container Runtime Interface (CRI).

With the Kubernetes CRI, container runtimes have a clear path towards integrating with Kubernetes. rktlet is the rkt implementation of that interface.

Project goals

The goal is to make rktlet the preferred means to run workloads with rkt in Kubernetes. But companies like Blablacar rely on the Kubernetes-internal implementation of rkt to run their infrastructure. Thus, we cannot just remove that implementation without having a viable alternative.

rktlet currently passes 129 of the 145 Kubernetes end-to-end conformance tests. We aim to have full compliance. Later in this article, we’ll look at what needs remain to get there.

Once rktlet it ready, the plan is to deprecate the rkt implementation in the core of Kubernetes.

How rktlet works

rktlet is a daemon that communicates with the kubelet via gRPC. The CRI is the interface by which kubelet and rktlet communicate. The main CRI methods are

  • RunPodSandbox(),
  • PodSandboxStatus(),
  • CreateContainer(),
  • StartContainer(),
  • StopPodSandbox(),
  • ListContainers(),
  • etc.

These methods handle lifecycle management and gather state.

To create pods, rktlet creates a transient systemd service using systemd-run with the appropriate rkt command line invocation. Subsequent actions like adding and removing containers to and from the pods, respectively, are done by calling the rkt command line tool.

The following component diagram provides a visualization of what we’ve described.

To try out rktlet, follow the Getting Started guide.

Driving rkt development

Work on rktlet has spurred a couple new features inside of rkt itself which we’ll take a moment to highlight.

Pod manipulation

rkt has always been pod-native, but the pods themselves were immutable. The original design did not allow for actions such as starting, stopping, or adding apps to a pod. These features were added to rkt in order to be CRI conformant. This work is described in the app level API document

Logging and attaching

Historically, apps in rkt have offloaded logging to a sidecar service — by default systemd-journald — that multiplexes their output to the outside world. The sidecar service handled logging and interactive applications reused a parent TTY.

But the CRI defines a logging format that is plaintext whereas systemd-journald’s output format is binary. Moreover, Kubernetes has an attaching feature that couldn’t be implemented with the old design.

To solve these problems, a component called iottymux was implemented. When enabled, it replaces systemd-journald completely; providing app logs that are formatted to be CRI compatible and the needed logic for the attach feature.

For a more detailed description of this design, check out the log attach design document.

Future work for rktlet

rktlet still needs work before it’s ready for production workloads and be 100% CRI compliant. Some of the work that still needs to be done is…

Join the team

If you’d like to join the effort, rktlet offers ample chances to get involved. Ongoing work is discussed in the #sig-node-rkt Kubernetes Slack channel. If you’re at Kubecon North America in Austin, please come by the rkt salon to talk about rkt and rktlet.

Thanks

Thanks to all those that have contributed to rktlet and to CoreOS, Blablacar, CNCF and our team at Kinvolk for supporting its development.

Running Kubernetes on Travis CI with minikube

It is not easily possible to run Kubernetes on Travis CI, as most methods of setting up a cluster need to create resources on AWS, or another cloud provider. And setting up VMs is also not possible as Travis CI doesn’t allow nested virtualization. This post explains how to use minikube without additional resources, with a few simple steps.

Our use case

As we are currently working with Chef on a project to integrate Habitat with Kubernetes(Habitat Operator), we needed a way to run the end-to-end tests on every pull request. Locally we use minikube, a tool to setup a local one-node Kubernetes cluster for development, or when we need a multi-node cluster, kube-spawn. But for automated CI tests we only currently require a single node setup. So we decided to use minikube to be able to easily catch any failed tests and debug and reproduce those locally.

Typically minikube requires a virtual machine to setup Kubernetes. One day this tweet was shared in our Slack. It seems that minikube has a not-so-well-documented way of running Kubernetes with no need for virtualization as it sets up localkube, a single binary for kubernetes that is executed in a Docker container and Travis CI already has Docker support. There is a warning against running this locally, but since we only use it on Travis CI, in an ephemeral environment, we concluded that this is an acceptable use case.

The setup

So this is what our setup looks like. Following is the example .travis.yml file:

sudo: required

env:
- CHANGE_MINIKUBE_NONE_USER=true

before_script:
- curl -Lo kubectl https://storage.googleapis.com/kubernetes-release/release/v1.7.0/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
- curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/
- sudo minikube start --vm-driver=none --kubernetes-version=v1.7.0
- minikube update-context
- JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}'; until kubectl get nodes -o jsonpath="$JSONPATH" 2>&1 | grep -q "Ready=True"; do sleep 1; done

How it works

First, it installs kubectl, which is a requirement of minikube. The need for sudo: required comes from minikube’s starting processes, which requires to be root. Having set the enviorment variable CHANGE_MINIKUBE_NONE_USER, minikube will automatically move config files to the appropriate place as well as adjust the permissions respectively. When using the none driver, the kubectl config and credentials generated will be owned by root and will appear in the root user’s home directory. The none driver then does the heavy lifting of setting up localkube on the host. Then the kubeconfig is updated with minikube update-context. And lastly we wait for Kubernetes to be up and ready.

Examples

This work is already being used in the Habitat Operator. For a simple live example setup have a look at this repo. If you have any questions feel free to ping me on twitter @LiliCosic.

Follow Kinvolk on twitter to get notified when new blog posts go live.

Habitat Operator - Running Habitat Services with Kubernetes

For the last few months, we’ve been working with the Habitat team at Chef to make Habitat-packaged applications run well in Kubernetes. The result of this collaboration is the Habitat Operator, a Kubernetes controller used to deploy, configure and manage applications packaged with Habitat inside of Kubernetes. This article will give an overview of that work — particularly the issues to address, solutions to those issues, and future work.

Habitat Overview

For the uninitiated, Habitat is a project designed to address building, deploying and running applications.

Building applications

Applications are built from shell scripts known as “plans” which describe how to build the application, and may optionally include configurations files and lifecycle hooks. From the information in the plan, Habitat can create a package of the application.

Deploy applications

In order to run an application with a container runtime like Docker or rkt, Habitat supports exporting packages to a Docker container image. You can then upload the container image to a registry and use it to deploy applications to a container orchestration system like Kubernetes.

Running applications

Applications packaged with Habitat — hereafter referred to as simply as applications — support the following runtime features.

These features are available because all Habitat applications run under a supervisor process called a Supervisor. The Supervisor takes care of restarting, reconfiguring and gracefully terminating services. The Supervisor also allows multiple instances of applications to run with the Supervisor communicating with other Supervisors via a gossip protocol. These can connect to form a ring and establish Service Groups for sharing configuration data and establishing topologies.

Integration with Kubernetes

Many of the features that Habitat provides overlap with features that are provided in Kubernetes. Where there is overlap, the Habitat Operator tries to translate, or defer, to the Kubernetes-native mechanism. One design goal of the Habitat Operator is to allow Kubernetes users to use the Kubernetes CLI without fear that Habitat applications will become out of sync. For example, update strategies are core feature of Kubernetes and should be handled by Kubernetes.

For the features that do not overlap — such as topologies and application binding — the Habitat Operator ensures that these work within Kubernetes.

Joining the ring

One of the fundamental challenges we faced when conforming Habitat to Kubernetes was forming and joining a ring. Habitat uses the --peer flag which is passed an IP address of a previously started Supervisor. But in the Kubernetes world this is not possible as all pods need to be started with the exact same command line flags. In order to be able to do this within Kubernetes, implemented a new flag in Habitat itself, --peer-watch-file. This flag takes a file which should contain a list of one or more IP addresses to the peers in the Service Group it would like to join. Habitat uses this information to form the ring between the Supervisors. This is implemented in the Habitat Operator using a Kubernetes ConfigMap which is mounted into each pod.

Initial Configuration

Habitat allows for drawing configuration information from different sources. One of them is a user.toml file which is used for initial configuration and is not gossiped within the ring. Because there can be sensitive data in configuration files, we use Kubernetes Secrets for all configuration data. The Habitat Operator mounts configuration files in the place where Habitat expects it to be found and the application automatically picks up this configuration as it normally would. This mechanism will also be reused to support configuration updates in the future.

Topologies

One of these is specifying the two different topologies that are supported in Habitat. The standalone topology — the default topology in Habitat — is used for applications that are independent of one another. With the leader/follower topology, the Supervisor handles leader election over the ring before the application starts. For this topology, three or more instances of an application must be available for a successful leader election to take place.

Ring encryption

A security feature of Habitat that we brought into the operator is securing the ring by encrypting all communications across the network.

Application binding

We also added an ability to do runtime binding, meaning that applications form a producer/consumer relationship at application start. The producer exports the configuration and the consumer, through the Supervisor ring, consumes that information. You can learn more about that in the demo below:

Future plans for Habitat operator

The Habitat Operator is in heavy development and we’re excited about the features that we have planned for the next months.

Export to Kubernetes

We’ve already started work on a exporter for Kubernetes. This will allow you to export the application you packaged with Habitat to a Docker image along with a generated manifest file that can be used to deploy directly to Kubernetes.

Dynamic configuration

As mentioned above, we are planning to extend the initial configuration and use the same logic for configuration updates. This work should be landing in Habitat very soon. With Habitat applications, configuration changes can be made without restarting pods. The behaviour for how to do configuration updates is defined in the applications Habitat plan.

Further Kubernetes integration and demos

We’re also looking into exporting to Helm charts in the near future. This could allow for bringing a large collection of Habitat-packaged to Kubernetes.

Another area to explore is integration between the Habitat Builder and Kubernetes. The ability to automatically recompile application, export images, and deploy to Kubernetes when dependencies are updated could bring great benefits to Habitat and Kubernetes users alike.

Conclusion

Please take the operator for a spin here. The first release is now available. All you need is an application packaged with Habitat and exported as Docker image, and that functionality is already in Habitat itself.

Note: The Habitat operator is compatible with Habitat version 0.36.0 onwards. If you have any questions feel free to ask on the #kubernetes channel in Habitat slack or open an issue on the Habitat operator.

Follow Kinvolk on twitter to get notified when new blog posts go live.

An update on gobpf - ELF loading, uprobes, more program types

Gophers by Ashley McNamara, Ponies by Deirdré Straughan - CC BY-NC-SA 4.0

Almost a year ago we introduced gobpf, a Go library to load and use eBPF programs from Go applications. Today we would like to give you a quick update on the changes and features added since then (i.e. the highlights of git log --oneline --no-merges --since="November 30th 2016" master).

Load BPF programs from ELF object files

With commit 869e637, gobpf was split into two subpackages (github.com/iovisor/gobpf/bcc and github.com/iovisor/gobpf/elf) and learned to load BPF programs from ELF object files. This allows users to pre-build their programs with clang/LLVM and its BPF backend as an alternative to using the BPF Compiler Collection.

One project where we at Kinvolk used pre-built ELF objects is the TCP tracer that we wrote for Weave Scope. Putting the program into the library allows us to go get and vendor the tracer as any other Go dependency.

Another important result of using the ELF loading mechanism is that the Scope container images are much smaller, as bcc and clang are not included and don’t add to the container image size.

Let’s see how this is done in practice by building a demo program to log open(2) syscalls to the ftrace trace_pipe:

// program.c

#include <linux/kconfig.h>
#include <linux/bpf.h>

#include <uapi/linux/ptrace.h>

// definitions of bpf helper functions we need, as found in
// http://elixir.free-electrons.com/linux/latest/source/samples/bpf/bpf_helpers.h

#define SEC(NAME) __attribute__((section(NAME), used))

#define PT_REGS_PARM1(x) ((x)->di)

static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
        (void *) BPF_FUNC_probe_read;
static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
        (void *) BPF_FUNC_trace_printk;

#define printt(fmt, ...)                                                   \
        ({                                                                 \
                char ____fmt[] = fmt;                                      \
                bpf_trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \
        })

// the kprobe

SEC("kprobe/SyS_open")
int kprobe__sys_open(struct pt_regs *ctx)
{
        char filename[256];

        bpf_probe_read(filename, sizeof(filename), (void *)PT_REGS_PARM1(ctx));

        printt("open(%s)\n", filename);

        return 0;
}

char _license[] SEC("license") = "GPL";
// this number will be interpreted by the elf loader
// to set the current running kernel version
__u32 _version SEC("version") = 0xFFFFFFFE;

On a Debian system, the corresponding Makefile could look like this:

# Makefile
# …

uname=$(shell uname -r)

build-elf:
        clang \
                -D__KERNEL__ \
                -O2 -emit-llvm -c program.c \
                -I /lib/modules/$(uname)/source/include \
                -I /lib/modules/$(uname)/source/arch/x86/include \
                -I /lib/modules/$(uname)/build/include \
                -I /lib/modules/$(uname)/build/arch/x86/include/generated \
                -o - | \
                llc -march=bpf -filetype=obj -o program.o

A small Go tool can then be used to load the object file and enable the kprobe with the help of gobpf:

// main.go

package main

import (
        "fmt"
        "os"
        "os/signal"

        "github.com/iovisor/gobpf/elf"
)

func main() {
        module := elf.NewModule("./program.o")
        if err := module.Load(nil); err != nil {
                fmt.Fprintf(os.Stderr, "Failed to load program: %v\n", err)
                os.Exit(1)
        }
        defer func() {
                if err := module.Close(); err != nil {
                        fmt.Fprintf(os.Stderr, "Failed to close program: %v", err)
                }
        }()

        if err := module.EnableKprobe("kprobe/SyS_open", 0); err != nil {
                fmt.Fprintf(os.Stderr, "Failed to enable kprobe: %v\n", err)
                os.Exit(1)
        }

        sig := make(chan os.Signal, 1)
        signal.Notify(sig, os.Interrupt, os.Kill)

        <-sig
}

Now every time a process uses open(2), the kprobe will log a message. Messages written with bpf_trace_printk can be seen in the trace_pipe “live trace”:

sudo cat /sys/kernel/debug/tracing/trace_pipe

With go-bindata it’s possible to bundle the compiled BPF program into the Go binary to build a single fat binary that can be shipped and installed conveniently.

Trace user-level functions with bcc and uprobes

Louis McCormack contributed support for uprobes in github.com/iovisor/gobpf/bcc and therefore it is now possible to trace user-level function calls. For example, to trace all readline() function calls from /bin/bash processes, you can run the bash_readline.go demo:

sudo -E go run ./examples/bcc/bash_readline/bash_readline.go

More supported program types for gobpf/elf

gobpf/elf learned to load programs of type TRACEPOINT, SOCKET_FILTER, CGROUP_SOCK and CGROUP_SKB:

Tracepoints

A program of type TRACEPOINT can be attached to any Linux tracepoint. Tracepoints in Linux are “a hook to call a function (probe) that you can provide at runtime.” A list of available tracepoints can be obtained with find /sys/kernel/debug/tracing/events -type d.

Socket filtering

Socket filtering is the mechanism used by tcpdump to retrieve packets matching an expression. With SOCKET_FILTER programs, we can filter data on a socket by attaching them with setsockopt(2).

cgroups

CGROUP_SOCK and CGROUP_SKB can be used to load and use programs specific to a cgroup. CGROUP_SOCK programs “run any time a process in the cgroup opens an AF_INET or AF_INET6 socket” and can be used to enable socket modifications. CGROUP_SKB programs are similar to SOCKET_FILTER and are executed for each network packet with the purpose of cgroup specific network filtering and accounting.

Continuous integration

We have setup continuous integration and written about how we use custom rkt stage1 images to test against various kernel versions. At the time of writing, gobpf has elementary tests to verify that programs and their sections can be loaded on kernel versions 4.4, 4.9 and 4.10 but no thorough testing of all functionality and features yet (e.g. perf map polling).

Miscellaneous

Thanks to contributors and clients

In closing, we’d like to thank all those who have contributed to gobpf. We look forward to merging more commits from contributors and seeing how others make use of gopbf.

A special thanks goes to Weaveworks for funding the work from which gobpf was born. Continued contributions have been possible through other clients, for whom we are helping build products (WIP) that leverage gobpf.

Introducing kube-spawn: a tool to create local, multi-node Kubernetes clusters

kube-spawn is a tool to easily start a local, multi-node Kubernetes cluster on a Linux machine. While its original audience was mainly developers of Kubernetes, it’s turned into a tool that is great for just trying Kubernetes out and exploring. This article will give a general introduction to kube-spawn and show how to use it.

Overview

kube-spawn aims to become the easiest means of testing and fiddling with Kubernetes on Linux. We started the project because it is still rather painful to start a multi-node Kubernetes cluster on our development machines. And the tools that do provide this functionality generally do not reflect the environments that Kubernetes will eventually be running on, a full Linux OS.

Running a Kubernetes cluster with kube-spawn

So, without further ado, let’s start our cluster. With one command kube-spawn fetches the Container Linux image, prepares the nodes, and deploys the cluster. Note that you can also do these steps individually with machinectl pull-raw, and the kube-spawn setup and init subcommands. But the up subcommand does this all for us.

$ sudo GOPATH=$GOPATH CNI_PATH=$GOPATH/bin ./kube-spawn up --nodes=3

When that command completes, you’ll have a 3-node Kubernetes cluster. You’ll need to wait for the nodes to be ready before its useful.

$ export KUBECONFIG=$GOPATH/src/github.com/kinvolk/kube-spawn/.kube-spawn/default/kubeconfig
$ kubectl get nodes
NAME           STATUS    AGE       VERSION
kube-spawn-0   Ready     1m        v1.7.0
kube-spawn-1   Ready     1m        v1.7.0
kube-spawn-2   Ready     1m        v1.7.0

Looks like all the nodes are ready. Let’s move on.

The demo app

In order to test that our cluster is working we’re going to deploy the microservices demo, Sock Shop, from our friends at Weaveworks. The Sock Shop is a complex microservices app that uses many components commonly found in real-world deployments. So it’s good to test that everything is working and gives us something more substantial to explore than a hello world app.

Cloning the demo app

To proceed, you’ll need to clone the microservices-demo repo and navigate to the deploy/kubernetes folder.

$ cd ~/repos
$ git clone https://github.com/microservices-demo/microservices-demo.git sock-shop
$ cd sock-shop/deploy/kubernetes/

Deploying the demo app

Now that we have things in place, let’s deploy. We first need to create the sock-shop namespace that the deployment expects.

$ kubectl create namespace sock-shop
namespace "sock-shop" created

With that, we’ve got all we need to deploy the app

$ kubectl create -f complete-demo.yaml
deployment "carts-db" created
service "carts-db" created
deployment "carts" created
service "carts" created
deployment "catalogue-db" created
service "catalogue-db" created
deployment "catalogue" created
service "catalogue" created
deployment "front-end" created
service "front-end" created
deployment "orders-db" created
service "orders-db" created
deployment "orders" created
service "orders" created
deployment "payment" created
service "payment" created
deployment "queue-master" created
service "queue-master" created
deployment "rabbitmq" created
service "rabbitmq" created
deployment "shipping" created
service "shipping" created
deployment "user-db" created
service "user-db" created
deployment "user" created
service "user" created

Once that completes, we still need to wait for all the pods to come up.

$ watch kubectl -n sock-shop get pods
NAME                            READY     STATUS    RESTARTS   AGE
carts-2469883122-nd0g1          1/1       Running   0          1m
carts-db-1721187500-392vt       1/1       Running   0          1m
catalogue-4293036822-d79cm      1/1       Running   0          1m
catalogue-db-1846494424-njq7h   1/1       Running   0          1m
front-end-2337481689-v8m2h      1/1       Running   0          1m
orders-733484335-mg0lh          1/1       Running   0          1m
orders-db-3728196820-9v07l      1/1       Running   0          1m
payment-3050936124-rgvjj        1/1       Running   0          1m
queue-master-2067646375-7xx9x   1/1       Running   0          1m
rabbitmq-241640118-8htht        1/1       Running   0          1m
shipping-2463450563-n47k7       1/1       Running   0          1m
user-1574605338-p1djk           1/1       Running   0          1m
user-db-3152184577-c8r1f        1/1       Running   0          1m

Accessing the sock shop

When they’re all ready, we have to find out which port and IP address we use to access the shop. For the port, let’s see which port the front-end services is using.

$ kubectl -n sock-shop get svc
NAME           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
carts          10.110.14.144    <none>        80/TCP         3m
carts-db       10.104.115.89    <none>        27017/TCP      3m
catalogue      10.110.157.8     <none>        80/TCP         3m
catalogue-db   10.99.103.79     <none>        3306/TCP       3m
front-end      10.105.224.192   <nodes>       80:30001/TCP   3m
orders         10.101.177.247   <none>        80/TCP         3m
orders-db      10.109.209.178   <none>        27017/TCP      3m
payment        10.107.53.203    <none>        80/TCP         3m
queue-master   10.111.63.76     <none>        80/TCP         3m
rabbitmq       10.110.136.97    <none>        5672/TCP       3m
shipping       10.96.117.56     <none>        80/TCP         3m
user           10.101.85.39     <none>        80/TCP         3m
user-db        10.107.82.6      <none>        27017/TCP      3m

Here we see that the front-end is exposed on port 30001 and it uses the external IP. This means that we can access the front-end services using any worker node IP address on port 30001. machinectl gives us each node’s IP address.

$ machinectl
MACHINE      CLASS     SERVICE        OS     VERSION  ADDRESSES
kube-spawn-0 container systemd-nspawn coreos 1492.1.0 10.22.0.137...
kube-spawn-1 container systemd-nspawn coreos 1492.1.0 10.22.0.138...
kube-spawn-2 container systemd-nspawn coreos 1492.1.0 10.22.0.139...

Remember, the first node is the master node and all the others are worker nodes. So in our case, we can open our browser to 10.22.0.138:30001 or 10.22.0.139:30001 and should be greeted by a shop selling socks.

Stopping the cluster

Once you’re done with your sock purchases, you can stop the cluster.

$ sudo ./kube-spawn stop
2017/08/10 01:58:00 turning off machines [kube-spawn-0 kube-spawn-1 kube-spawn-2]...
2017/08/10 01:58:00 All nodes are stopped.

A guided demo

If you’d like a more guided tour, you’ll find it here.

As mentioned in the video, kube-spawn creates a .kube-spawn directory in the current directory where you’ll find several files and directories under the default directory. In order to not be constrained by the size of each OS Container, we mount each node’s /var/lib/docker directory here. In this way, we can make use of the host’s disk space. Also, we don’t currently have a clean command. So you can run rm -rf .kube-spawn/ if you want to completely clean up things.

Conclusion

We hope you find kube-spawn as useful as we do. For us, it’s the easiest way to test changes to Kubernetes or spin up a cluster to explore Kubernetes.

There are still lots of improvements (some very obvious) that can be made. PRs are very much welcome!

All Systems Go! - The Userspace Linux Conference

At Kinvolk we spend a lot of time working on and talking about the Linux userspace. We can regularly be found presenting our work at various events and discussing the details of our work with those who are interested. These events are usually either very generally about open source, or focused on a very specific technology, like containers, systemd, or ebpf. While these events are often awesome, and absolutely essential, they simply have a focus that is either too broad, or too specific.

What we felt was missing was an event focused on the Linux userspace itself, and less on the projects and products built on top, or the kernel below. This is the focus of All Systems Go! and why we are excited to be a part of it.

All Systems Go! is designed to be a community event. Tickets to All Systems Go! are affordable — starting at less than 30 EUR — and the event takes place during the weekend, making it more accessible to hobbyists and students. It’s also conveniently scheduled to fall between DockerCon EU in Copenhagen and Open Source Summit in Prague.

Speakers

To make All Systems Go! work, we’ve got to make sure that we get the people to attend who are working at this layer of the system, and on the individual projects that make up the Linux userspace. As a start, we’ve invited a first round of speakers, who also happen to be the CFP selection committee. We’re very happy to welcome to the All Systems Go! team…

While we’re happy to have this initial group of speakers, what’s really going to make All Systems Go! awesome are all the others in the community who submit their proposals and offer their perspectives and voices.

Sponsorship

Sponsors are crucial to open source community events. All Systems Go! is no different. In fact, sponsors are essential to keeping All Systems Go! an affordable and accessible event.

We will soon be announcing our first round of sponsors. If your organization would like to be amongst that group please have a look at our sponsorship prospectus and get in touch.

See you there!

We’re looking forward to welcoming the Linux userspace community to Berlin. Hope to see you there!

Using custom rkt stage1 images to test against various kernel versions

Introduction

When writing software that is tightly coupled with the Linux kernel, it is necessary to test on multiple versions of the kernel. This is relatively easy to do locally with VMs, but when working on open-source code hosted on Github, one wants to make these tests a part of the project’s continuous integration (CI) system to ensure that each pull request runs the tests and passes before merging.

Most CI systems run tests inside containers and, very sensibly, use various security mechanisms to restrict what the code being tested can access. While this does not cause problems for most use cases, it does for us. It blocks certain syscalls that are needed to, say, test a container runtime like rkt, or load ebpf programs into the kernel for testing, like we need to do to test gobpf and tcptracer-bpf. It also doesn’t allow us to run virtual machines which we need to be able to do to run tests on different versions of the kernel.

Finding a continuous integration service

While working on the rkt project, we did a survey of CI systems to find the ones that we could use to test rkt itself. Because of the above-stated requirements, it was clear that we needed one that gave us the option to run tests inside a virtual machine. This makes the list rather small; in fact, we were left with only SemaphoreCI.

SemaphoreCI supports running Docker inside of the test environment. This is possible because the test environment they provide for this is simply a VM. For rkt, this allowed us to run automatic tests for the container runtime each time a PR was submitted and/or changed.

However, it doesn’t solve the problem of testing on various kernels and kernel configurations as we want for gobpf and tcptracer-bpf. Luckily, this is where rkt and its KVM stage1 come to the rescue.

Our solution

To continuously test the work we are doing on Weave Scope, tpctracer-bpf and gobpf, we not only need a relatively new Linux kernel, but also require a subset of features like CONFIG_BPF=y or CONFIG_HAVE_KPROBES=y to be enabled.

With rkt’s KVM stage1 we can run our software in a virtual machine and, thanks to rkt’s modular architecture, build and use a custom stage1 suited to our needs. This allows us to run our tests on any platform that allows rkt to run; in our case, Semaphore CI.

Building a custom rkt stage1 for KVM

Our current approach relies on App Container Image (ACI) dependencies. All of our custom stage1 images are based on rkt’s coreos.com/rkt/stage1-kvm. In this way, we can apply changes to particular components (e.g. the Linux kernel) while reusing the other parts of the upstream stage1 image.

An ACI manifest template for such an image could look like the following.

{
        "acKind": "ImageManifest",
        "acVersion": "0.8.9",
        "name": "kinvolk.io/rkt/stage1-kvm-linux-{{kernel_version}}",
        "labels": [
                {
                        "name": "arch",
                        "value": "amd64"
                },
                {
                        "name": "os",
                        "value": "linux"
                },
                {
                        "name": "version",
                        "value": "0.1.0"
                }
        ],
        "annotations": [
                {
                        "name": "coreos.com/rkt/stage1/run",
                        "value": "/init"
                },
                {
                        "name": "coreos.com/rkt/stage1/enter",
                        "value": "/enter_kvm"
                },
                {
                        "name": "coreos.com/rkt/stage1/gc",
                        "value": "/gc"
                },
                {
                        "name": "coreos.com/rkt/stage1/stop",
                        "value": "/stop_kvm"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/add",
                        "value": "/app-add"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/rm",
                        "value": "/app-rm"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/start",
                        "value": "/app-start"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/stop",
                        "value": "/app-stop"
                },
                {
                        "name": "coreos.com/rkt/stage1/interface-version",
                        "value": "5"
                }
        ],
        "dependencies": [
                {
                        "imageName": "coreos.com/rkt/stage1-kvm",
                        "labels": [
                                {
                                        "name": "os",
                                        "value": "linux"
                                },
                                {
                                        "name": "arch",
                                        "value": "amd64"
                                },
                                {
                                        "name": "version",
                                        "value": "1.23.0"
                                }
                        ]
                }
        ]
}

Note: rkt doesn’t automatically fetch stage1 dependencies and we have to pre-fetch those manually.

To build a kernel (arch/x86/boot/bzImage), we use make bzImage after applying a single patch to the source tree. Without the patch, the kernel would block and not return control to rkt.

# change directory to kernel source tree
curl -LsS https://github.com/coreos/rkt/blob/v1.23.0/stage1/usr_from_kvm/kernel/patches/0001-reboot.patch -O
patch --silent -p1 < 0001-reboot.patch
# configure kernel
make bzImage

We now can combine the ACI manifest with a root filesystem holding our custom built kernel, for example:

aci/4.9.4/
├── manifest
└── rootfs
    └── bzImage

We are now ready to build the stage1 ACI with actool:

actool build --overwrite aci/4.9.4 my-custom-stage1-kvm.aci

Run rkt with a custom stage1 for KVM

rkt offers multiple command line flags to be provided with a stage1; we use --stage1-path=. To smoke test our newly built stage1, we run a Debian Docker container and call uname -r so we make sure our custom built kernel is actually used:

rkt fetch image coreos.com/rkt/stage1-kvm:1.23.0 # due to rkt issue #2241
rkt run \
  --insecure-options=image \
  --stage1-path=./my-custom-stage1-kvm.aci \
  docker://debian --exec=/bin/uname -- -r
4.9.4-kinvolk-v1
[...]

We set CONFIG_LOCALVERSION="-kinvolk-v1" in the kernel config and the version is correctly shown as 4.9.4-kinvolk-v1.

Run on Semaphore CI

Semaphore does not include rkt by default on their platform. Hence, we have to download rkt in semaphore.sh as a first step:

#!/bin/bash

readonly rkt_version="1.23.0"

if [[ ! -f "./rkt/rkt" ]] || \
  [[ ! "$(./rkt/rkt version | awk '/rkt Version/{print $3}')" == "${rkt_version}" ]]; then

  curl -LsS "https://github.com/coreos/rkt/releases/download/v${rkt_version}/rkt-v${rkt_version}.tar.gz" \
    -o rkt.tgz

  mkdir -p rkt
  tar -xvf rkt.tgz -C rkt --strip-components=1
fi

[...]

After that we can pre-fetch the stage1 image we depend on and then run our tests. Note that we now use ./rkt/rkt. And we use timeout to make sure our tests fail if they cannot be finished in a reasonable amount of time.

Example:

sudo ./rkt/rkt image fetch --insecure-options=image coreos.com/rkt/stage1-kvm:1.23.0
sudo timeout --foreground --kill-after=10 5m \
  ./rkt/rkt \
  --uuid-file-save=./rkt-uuid \
  --insecure-options=image,all-run \
  --stage1-path=./rkt/my-custom-stage1-kvm.aci \
  ...
  --exec=/bin/sh -- -c \
  'cd /go/... ; \
    go test -v ./...'

--uuid-file-save=./rkt-uuid is required to determine the UUID of the started container from semaphore.sh to read its exit status (since it is not propagated on the KVM stage1) after the test finished and exit accordingly:

[...]

test_status=$(sudo ./rkt/rkt status $(<rkt-uuid) | awk '/app-/{split($0,a,"=")} END{print a[2]}')
exit $test_status

Bind mount directories from stage1 into stage2

If you want to provide data to stage2 from stage1 you can do this with a small systemd drop-in unit to bind mount the directories. This allows you to add or modify content without actually touching the stage2 root filesystem.

We did the following to provide the Linux kernel headers to stage2:

# add systemd drop-in to bind mount kernel headers
mkdir -p "${rootfs_dir}/etc/systemd/system/[email protected]"
cat <<EOF >"${rootfs_dir}/etc/systemd/system/[email protected]/10-bind-mount-kernel-header.conf"
[Service]
ExecStartPost=/usr/bin/mkdir -p %I/${kernel_header_dir}
ExecStartPost=/usr/bin/mount --bind "${kernel_header_dir}" %I/${kernel_header_dir}
EOF

Note: for this to work you need to have mkdir in stage1, which is not included in the default rkt stage1-kvm. We use the one from busybox: https://busybox.net/downloads/binaries/1.26.2-i686/busybox_MKDIR

Automating the steps

We want to be able to do this for many kernel versions. Thus, we have created a tool, stage1-builder, that does most of this for us. With stage1-builder you simply need to add the kernel configuration to the config directory and run the ./builder script. The result is an ACI file containing our custom kernel with a dependency on the upstream kvm-stage1.

Conclusion

With SemaphoreCI providing us with a proper VM and rkt’s modular stage1 architecture, we have put together a CI pipeline that allows us to test gobpf and tcptracer-bpf on various kernels. In our opinion this setup is much preferable to the alternative, setting up and maintaining Jenkins.

Interesting to point out is that we did not have to use or make changes to rkt’s build system. Leveraging ACI dependencies was all we needed to swap out the KVM stage1 kernel. For the simple case of testing software on various kernel versions, rkt’s modular design has proven to be very useful.

Kinvolk Presenting at FOSDEM 2017

The same procedure as last year, Miss Sophie?
The same procedure as every year, James!

As with every year, we’ve reserved the first weekend of February to attend FOSDEM, the premier open-source conference in Europe. We’re looking forward to having drinks and chatting with other open-source contributors and enthusiasts.

But it’s not all fun and games for us. The Kinvolk team has three talks; one each in the Go, Testing and Automation, & Linux Containers and Microservices devrooms.

The talks

We look forward to sharing our work and having conversations about the following topics…

If you’ll be there and are interested in those, or other projects we work on, please do track us down.

We look forward to seeing you there!

Introducing gobpf - Using eBPF from Go

Gopher by Takuya Ueda - CC BY 3.0

What is eBPF?

eBPF is a “bytecode virtual machine” in the Linux kernel that is used for tracing kernel functions, networking, performance analysis and more. Its roots lay in the Berkley Packet Filter (sometimes called LSF, Linux Socket Filtering), but as it supports more operations (e.g. BPF_CALL 0x80 /* eBPF only: function call */) and nowadays has much broader use than packet filtering on a socket, it’s called extended BPF.

With the addition of the dedicated bpf() syscall in Linux 3.18, it became easier to perform the various eBPF operations. Further, the BPF compiler collection from the IO Visor Project and its libbpf provide a rich set of helper functions as well as Python bindings that make it more convenient to write eBPF powered tools.

To get an idea of how eBPF looks, let’s take a peek at struct bpf_insn prog[] - a list of instructions in pseudo-assembly. Below we have a simple user-space C program to count the number of fchownat(2) calls. We use bpf_prog_load from libbpf to load the eBPF instructions as a kprobe and use bpf_attach_kprobe to attach it to the syscall. Now each time fchownat is called, the kernel executes the eBPF program. The program loads the map (more about maps later), increments the counter and exits. In the C program, we read the value from the map and print it every second.

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <linux/version.h>

#include <bcc/bpf_common.h>
#include <bcc/libbpf.h>

int main() {
	int map_fd, prog_fd, key=0, ret;
	long long value;
	char log_buf[8192];
	void *kprobe;

	/* Map size is 1 since we store only one value, the chown count */
	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1);
	if (map_fd < 0) {
		fprintf(stderr, "failed to create map: %s (ret %d)\n", strerror(errno), map_fd);
		return 1;
	}

	ret = bpf_update_elem(map_fd, &key, &value, 0);
	if (ret != 0) {
		fprintf(stderr, "failed to initialize map: %s (ret %d)\n", strerror(errno), ret);
		return 1;
	}

	struct bpf_insn prog[] = {
		/* Put 0 (the map key) on the stack */
		BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
		/* Put frame pointer into R2 */
		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
		/* Decrement pointer by four */
		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
		/* Put map_fd into R1 */
		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
		/* Load current count from map into R0 */
		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
			     BPF_FUNC_map_lookup_elem),
		/* If returned value NULL, skip two instructions and return */
		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
		/* Put 1 into R1 */
		BPF_MOV64_IMM(BPF_REG_1, 1),
		/* Increment value by 1 */
		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
		/* Return from program */
		BPF_EXIT_INSN(),
	};

	prog_fd = bpf_prog_load(BPF_PROG_TYPE_KPROBE, prog, sizeof(prog), "GPL", LINUX_VERSION_CODE, log_buf, sizeof(log_buf));
	if (prog_fd < 0) {
		fprintf(stderr, "failed to load prog: %s (ret %d)\ngot CAP_SYS_ADMIN?\n%s\n", strerror(errno), prog_fd, log_buf);
		return 1;
	}

	kprobe = bpf_attach_kprobe(prog_fd, "p_sys_fchownat", "p:kprobes/p_sys_fchownat sys_fchownat", -1, 0, -1, NULL, NULL);
	if (kprobe == NULL) {
		fprintf(stderr, "failed to attach kprobe: %s\n", strerror(errno));
		return 1;
	}

	for (;;) {
		ret = bpf_lookup_elem(map_fd, &key, &value);
		if (ret != 0) {
			fprintf(stderr, "failed to lookup element: %s (ret %d)\n", strerror(errno), ret);
		} else {
			printf("fchownat(2) count: %lld\n", value);
		}
		sleep(1);
	}

	return 0;
}

The example requires libbcc and can be compiled with:

gcc -I/usr/include/bcc/compat main.c -o chowncount -lbcc

Nota bene: the increment in the example code is not atomic. In real code, we would have to use one map per CPU and aggregate the result.

It is important to know that eBPF programs run directly in the kernel and that their invocation depends on the type. They are executed without change of context. As we have seen above, kprobes for example are triggered whenever the kernel executes a specified function.

Thanks to clang and LLVM, it’s not necessary to actually write plain eBPF instructions. Modules can be written in C and use functions provided by libbpf (as we will see in the gobpf example below).

eBPF Program Types

The type of an eBPF program defines properties like the kernel helper functions available to the program or the input it receives from the kernel. Linux 4.8 knows the following program types:

// https://github.com/torvalds/linux/blob/v4.8/include/uapi/linux/bpf.h#L90-L98
enum bpf_prog_type {
	BPF_PROG_TYPE_UNSPEC,
	BPF_PROG_TYPE_SOCKET_FILTER,
	BPF_PROG_TYPE_KPROBE,
	BPF_PROG_TYPE_SCHED_CLS,
	BPF_PROG_TYPE_SCHED_ACT,
	BPF_PROG_TYPE_TRACEPOINT,
	BPF_PROG_TYPE_XDP,
};

A program of type BPF_PROG_TYPE_SOCKET_FILTER, for instance, receives a struct __sk_buff * as its first argument whereas it’s struct pt_regs * for programs of type BPF_PROG_TYPE_KPROBE.

eBPF Maps

Maps are a “generic data structure for storage of different types of data” and can be used to share data between eBPF programs as well as between kernel and userspace. The key and value of a map can be of arbitrary size as defined when creating the map. The user also defines the maximum number of entries (max_entries). Linux 4.8 knows the following map types:

// https://github.com/torvalds/linux/blob/v4.8/include/uapi/linux/bpf.h#L78-L88
enum bpf_map_type {
	BPF_MAP_TYPE_UNSPEC,
	BPF_MAP_TYPE_HASH,
	BPF_MAP_TYPE_ARRAY,
	BPF_MAP_TYPE_PROG_ARRAY,
	BPF_MAP_TYPE_PERF_EVENT_ARRAY,
	BPF_MAP_TYPE_PERCPU_HASH,
	BPF_MAP_TYPE_PERCPU_ARRAY,
	BPF_MAP_TYPE_STACK_TRACE,
	BPF_MAP_TYPE_CGROUP_ARRAY,
};

While BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_ARRAY are generic maps for different types of data, BPF_MAP_TYPE_PROG_ARRAY is a special purpose array map. It holds file descriptors referring to other eBPF programs and can be used by an eBPF program to “replace its own program flow with the one from the program at the given program array slot”. The BPF_MAP_TYPE_PERF_EVENT_ARRAY map is for storing a data of type struct perf_event in a ring buffer.

In the example above we used a map of type hash with a size of 1 to hold the call counter.

gobpf

In the context of the work we are doing on Weave Scope for Weaveworks, we have been working extensively with both eBPF and Go. As Scope is written in Go, it makes sense to use eBPF directly from Go.

In looking at how to do this, we stumbled upon some code in the IO Visor Project that looked like a good starting point. After talking to the folks at the project, we decided to move this out into a dedicated repository: https://github.com/iovisor/gobpf gobpf is a Go library that leverages the bcc project to make working with eBPF programs from Go simple.

To get an idea of how this works, the following example chrootsnoop shows how to use a bpf.PerfMap to monitor chroot(2) calls:

package main

import (
	"bytes"
	"encoding/binary"
	"fmt"
	"os"
	"os/signal"
	"unsafe"

	"github.com/iovisor/gobpf"
)

import "C"

const source string = `
#include <uapi/linux/ptrace.h>
#include <bcc/proto.h>

typedef struct {
	u32 pid;
	char comm[128];
	char filename[128];
} chroot_event_t;

BPF_PERF_OUTPUT(chroot_events);

int kprobe__sys_chroot(struct pt_regs *ctx, const char *filename)
{
	u64 pid = bpf_get_current_pid_tgid();
	chroot_event_t event = {
		.pid = pid >> 32,
	};
	bpf_get_current_comm(&event.comm, sizeof(event.comm));
	bpf_probe_read(&event.filename, sizeof(event.filename), (void *)filename);
	chroot_events.perf_submit(ctx, &event, sizeof(event));
	return 0;
}
`

type chrootEvent struct {
	Pid      uint32
	Comm     [128]byte
	Filename [128]byte
}

func main() {
	m := bpf.NewBpfModule(source, []string{})
	defer m.Close()

	chrootKprobe, err := m.LoadKprobe("kprobe__sys_chroot")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to load kprobe__sys_chroot: %s\n", err)
		os.Exit(1)
	}

	err = m.AttachKprobe("sys_chroot", chrootKprobe)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to attach kprobe__sys_chroot: %s\n", err)
		os.Exit(1)
	}

	chrootEventsTable := bpf.NewBpfTable(0, m)

	chrootEventsChannel := make(chan []byte)

	chrootPerfMap, err := bpf.InitPerfMap(chrootEventsTable, chrootEventsChannel)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to init perf map: %s\n", err)
		os.Exit(1)
	}

	sig := make(chan os.Signal, 1)
	signal.Notify(sig, os.Interrupt, os.Kill)

	go func() {
		var chrootE chrootEvent
		for {
			data := <-chrootEventsChannel
			err := binary.Read(bytes.NewBuffer(data), binary.LittleEndian, &chrootE)
			if err != nil {
				fmt.Fprintf(os.Stderr, "Failed to decode received chroot event data: %s\n", err)
				continue
			}
			comm := (*C.char)(unsafe.Pointer(&chrootE.Comm))
			filename := (*C.char)(unsafe.Pointer(&chrootE.Filename))
			fmt.Printf("pid %d %s called chroot(2) on %s\n", chrootE.Pid, C.GoString(comm), C.GoString(filename))
		}
	}()

	chrootPerfMap.Start()
	<-sig
	chrootPerfMap.Stop()
}

You will notice that our eBPF program is written in C for this example. The bcc project uses clang to convert the code to eBPF instructions.

We don’t have to interact with libbpf directly from our Go code, as gobpf implements a callback and makes sure we receive the data from our eBPF program through the chrootEventsChannel.

To test the example, you can run it with sudo -E go run chrootsnoop.go and for instance execute any systemd unit with RootDirectory statement. A simple chroot ... also works, of course.

# hello.service
[Unit]
Description=hello service

[Service]
RootDirectory=/tmp/chroot
ExecStart=/hello

[Install]
WantedBy=default.target

You should see output like:

pid 7857 (hello) called chroot(2) on /tmp/chroot

Conclusion

With its growing capabilities, eBPF has become an indispensable tool for modern Linux system software. gobpf helps you to conveniently use libbpf functionality from Go.

gobpf is in a very early stage, but usable. Input and contributions are very much welcome.

If you want to learn more about our use of eBPF in software like Weave Scope, stay tuned and have a look at our work on GitHub: https://github.com/kinvolk

Follow Kinvolk on Twitter to get notified when new blog posts go live.

Testing web services with traffic control on Kubernetes

This is part 2 of our “testing applications with traffic control series”. See part 1, testing degraded network scenarios with rkt, for detailed information about how traffic control works on Linux.

In this installment we demonstrate how to test web services with traffic control on Kubernetes. We introduce tcd, a simple traffic control daemon developed by Kinvolk for this demo. Our demonstration system runs on Openshift 3, Red Hat’s Container Platform based on Kubernetes, and uses the excellent Weave Scope, an interactive container monitoring and visualization tool.

We’ll be giving a live demonstration of this at the OpenShift Commons Briefing on May 26th, 2016. Please join us there.

The premise

As discussed in part 1 of this series, tests generally run under optimal networking conditions. This means that standard testing procedures neglect a whole bevy of issues that can arise due to poor network conditions.

Would it not be prudent to also test that your services perform satisfactorily when there is, for example, high packet loss, high latency, a slow rate of transmission, or a combination of those? We think so, and if you do too, please read on.

Traffic control on a distributed system

Let’s now make things more concrete by using tcd in our Kubernetes cluster.

The setup

To get started, we need to start an OpenShift ready VM to provide us our Kubernetes cluster. We’ll then create an OpenShift project and do some configuration.

If you want to follow along, you can go to our demo repository which will guide you through installing and setting up things. The pieces Before diving into the traffic control demo, we want to give you a really quick overview of tcd, OpenShift and Weave Scope.

tcd (traffic control daemon)

tcd is a simple daemon that runs on each Kubernetes node and responds to API calls. tcd manipulates the traffic control settings of the pods using the tc command which we briefly mentioned in part 1. It’s decoupled from the service being tested, meaning you can stop and restart the daemon on a pod without affecting its connectivity.

In this demo, it receives commands from buttons exposed in Weave Scope.

OpenShift

OpenShift is Red Hat’s container platform that makes it simple to build, deploy, manage and secure containerized applications at scale on any cloud infrastructure, including Red Hat’s own hosted offering, OpenShift Dedicated. Version 3 of OpenShift uses Kubernetes under the hood to maintain cluster health and easily scale services.

In the following figure, you see an example of the OpenShift dashboard with the running pods.

Here we have 1 Weave Scope App pod, 3 ping test pods, 1 tcd pod, and one Weave Scope App. Using the arrow buttons one can scale the application up and down and the circle changes color depending on the status of the application (e.g. scaling, terminating, etc.).

Weave Scope

Weave Scope helps to intuitively understand, monitor, and control containerized applications. It visually represents pods and processes running on Kubernetes and allows one to drill into pods, showing information such as CPU & memory usage, running processes, etc. One can also stop, start, and interact with containerized applications directly through its UI.

While this graphic shows Weave Scope displaying containers, we see at the top that we can also display information about processes and hosts.

How the pieces fit together

Now that we understand the individual pieces, let’s see how it all works together. Below is a diagram of our demo system.

Here we have 2 Kubernetes nodes each running one instance of the tcd daemon. tcd can only manage the traffic control settings of pods local to the Kubernetes node on which it’s running, thus the need for one per node.

On the right we see the Weave Scope app showing details for the selected pod; in this case, the one being pointed to by (4). In the red oval, we see the three buttons we’ve added to Scope app for this demo. These set the network connectivity parameters of the selected pod’s egress traffic to a latency of 2000ms, 300ms, 1ms, respectively, from left to right.

When clicked (1), the scope app sends a message (2) to the Weave Scope probe running on the selected pod’s Kubernetes node. The Weave Scope probe sends a gRPC message (3) to the tcd daemon, in this case a ConfigureEgressMethod message, running on its Kubernetes node telling it to configure the pods egress traffic (4) accordingly.

While this demo only configures the latency, tcd can also be used to configure the bandwidth and the percentage of packet drop. As we saw in part 1, those parameters are features directly provided by the Linux netem queuing discipline.

Being able to dynamically change the network characteristics for each pod, we can observe the behaviour of services during transitions as well as in steady state. Of course, by observe we mean test,which we’ll turn to now.

Testing with traffic control

Now for 2 short demos to show how traffic control can be used for testing.

Ping test

This is a contrived demo to show that the setup works and we can, in fact, manipulate the egress traffic characteristics of a pod.

The following video shows a pod downloading a small file from Internet with the wget command, with the target host being the one for which we are adjusting the packet latency.

It should be easy to see the affects that adjusting the latency has; with greater latency it takes longer to get a reply.

Guestbook app test

We use the Kubernetes guestbook example for our next, more real-world, demo. Some small modifications have been made to provide user-feedback when the reply from the web server takes a long time, showing a “loading…” message. Generally, this type of thing goes untested because, as we mentioned in the introduction, our tests run under favorable networking conditions.

Tools like Selenium and agouti allow for testing web applications in an automated way without manually interacting with a browser. For this demo we’ll be using agouti with its Chrome backend so that we can see the test run.

In the following video we see this feature being automatically tested by a Go script using the Ginkgo testing framework and Gomega matcher library.

In this demo, testers still need to configure the traffic control latency manually by clicking on the Weave Scope app buttons before running the test. However, since tcd can accept commands over gRPC, the Go script could easily connect to tcd to perform that configuration automatically, and dynamically, at run time. We’ll leave that as an exercise for the reader. :)

Conclusion

With Kubernetes becoming a defacto building block of modern container platforms, we now have a basis on which to start integrating features in a standardized way that have long gone ignored. We think traffic control for testing, and other creative endeavors, is a good example of this.

If you’re interested in moving this forward, we encourage you to take what we’ve started and run with it. And whether you just want to talk to us about this or you need professional support in your efforts, we’d be happy to talk to you.

Thanks to…

We’d like to thank Ilya & Tom from Weaveworks and Jorge & Ryan from Red Hat for helping us with some technical issues we ran into while setting up this demo. And a special thanks to Diane from the OpenShift project for helping coordinate the effort.

Introducing systemd.conf 2016

The systemd project will be having its 2nd conference—systemd.conf—from Sept. 28th to Oct. 1st, once again at betahaus in Berlin. After the success of last year’s conference, we’re looking forward to having much of the systemd community in Berlin for a second consecutive year. As this year’s event takes place just before LinuxCon Europe, we’re expecting some new faces.

Kinvolk’s involvement

As an active user and contributor to systemd, currently through our work on rkt, we’re interested in promoting systemd and helping provide a place for the systemd community to gather.

Last year, Kinvolk helped with much of the organization. This year, we’re happy to be expanding our involvement to include handling the financial-side of the event.

In general, Kinvolk is willing to help provide support to open source projects who want to hold events in Berlin. Just send us a mail to [email protected]

Don’t fix what isn’t broken

As feedback from last year’s post–conference survey showed, most attendees were pleased with the format. Thus, this year very little will change. The biggest difference is that we’re adding another room to accommodate a few more people and to facilitate impromptu breakout sessions. Some other small changes are that we’ll have warm lunches instead of sandwiches and we’ve dropped the speakers dinner as we felt it wasn’t in line with the goal of bringing all attendees together.

Workshop day

A new addition to systemd.conf, is the workshop day. The audience for systemd.conf 2015 was predominantly systemd contributors and proficient users. This was very much expected and intended.

However, we also want to give people of varying familiarity with the systemd project the chance to learn more from the people who know it best. The workshop day is intended to facilitate this. The call for presentations (CfP) will include a call for workshop sessions. These workshop sessions will be 2 to 3-hour hands-on sessions covering various areas of, or related to, the systemd project. You can consider it a day of systemd training if that helps with getting approval to attend. :)

As we expect a different audience for workshops than for the presentation and hackfest days, we will be issuing separate tickets. Tickets will become available once the call for participation opens.

Get involved!

There are several ways you can help make systemd.conf 2016 a success.

Become a sponsor

These events are only possible with the support of sponsors. In addition to helping the event be more awesome, your sponsorship allows us to bring more of the community together by sponsoring the attendance of those community member that need financial assistance to attend.

See the systemd.conf 2016 website for how to become a sponsor.

Submitting talk and workshop proposals

systemd.conf is only as good as the people who attend and the content they provide. In a few weeks we’ll be announcing the opening of the CfP. If you, or your organization, is doing interesting things with systemd, we encourage you to submit a proposal. If you want to spread your knowledge of systemd with others, please consider submitting a proposal for a workshop session.

We’re excited about what this year’s event will bring and look forward to seeing you at systemd.conf 2016!

Testing Degraded Network Scenarios with rkt

The current state of testing

Testing applications is important. Some even go as far as saying, “If it isn’t tested, it doesn’t work”. While that may have both a degree of truth and untruth to it, the rise of continuous integration (CI) and automated testing have shown that the software industry is taking testing seriously.

However, there is at least one area of testing that is difficult to automate and, thus, hasn’t been adequately incorporated into testing scenarios: poor network connectivity.

The typical testing process has the developer as the first line of defence. Developers usually work within reliable networking conditions. The developers then submit their code to a CI system which also runs tests under good networking conditions. Once the CI system goes green, internal testing is usually done; ship it!

Nowhere in this process were scenarios tested where your application experiences degraded network conditions. If your internal tests don’t cover these scenarios then it’s your users who’ll be doing the testing. This is far from an ideal situation and goes against the “test early test often” mantra of CI; a bug will cost you more the later it’s caught.

Three examples

To make this more concrete, let’s look at a few examples where users might notice issues that you, or your testing infrastructure, may not:

  • A web shop you click on “buy”, it redirects to a new page but freezes because of a connection issue. The user does not get feedback whether the javascript code will try again automatically; the user does not know whether she should refresh. That’s a bug. Once fixed, how do you test it? You need to break the connection just before the test script clicks on the “buy” link.
  • A video stream server The Real-Time Protocol (RTP) uses UDP packets. If some packets drop or arrive too late, it’s not a big deal; the video player will display a degraded video because of the missing packets but the stream will otherwise play just fine. Or, will it? So how can the developers of a video stream server test a scenario where 3% of packets are dropped or delayed?
  • Applications like etcd or zookeeper implement a consensus protocol. They should be designed to handle a node disconnecting from the network and network splits. See the approach CoreOS takes for an example.

It doesn’t take much imagination to come up with more, but these should be enough to make the point.

Where Linux can help

What functionality does the Linux kernel provide to enable us to test these scenarios?

Linux provides a means to shape both the egress traffic (emitted by a network interface) and to some extend the ingress traffic (received by a network interface). This is done by way of qdiscs, short for queuing disciplines. In essence, a qdisc is a packet scheduler. Using different qdiscs we can change the way packets are scheduled. qdiscs can have associated classes and filters. These all combine to let us delay, drop, or rate-limit packets, among a host of other things. A complete description is out of the scope of this blog post.

For our purposes, we’ll just look at one qdisc called “netem”, short for network emulation. This will allow us to tweak the packet scheduling characteristics we want.

What about containers?

Up to this point we haven’t even mentioned containers. That’s because the story is the same with regards to traffic control whether we’re talking about bare-metal servers, VMs or containers. Containers reside in their own network namespace, providing the container with a completely isolated network. Thus, the traffic between containers, or between a container and the host, can all be shaped in the same way.

Testing the idea

As a demonstration I’ve created a simple demo that starts an RTP server in a container using rkt. In order to easily tweak network parameters, I’ve hacked up a GUI written in Gtk/Javascript. And finally, to see the results we just need to point a video player to our RTP server.

We’ll step through the demo below. But if you want to play along at home, you can find the code in the kinvolk/demo repo on Github

Running the demo

First, I start the video streaming server in a rkt pod. The server streams the Elephant Dreams movie to a media player via the RTP/RTSP protocol. RTSP uses a TCP connection to send commands to the server. Examples of commands are choosing the file to play or seeking to a point in the middle of the stream. RTP it what actually sends the video via UDP packets.

Second, we start the GUI to dynamically change some parameters of the network emulator. What this does is connect to the rkt network namespace and change the egress qdisc using Linux’s tc command.

Now we can adjust the values as we like. For example, when I add 5% packet loss, the quality is degraded but not interrupted. When I remove the packet loss, the video becomes clear again. When I add 10s latency in the network, the video freezes. Play the video to see this in action.

What this shows us is that traffic control can be used effectively with containers to test applications - in this case a media server.

Next steps

The drawback to this approach is that it’s still manual. For automated testing we don’t want a GUI. Rather, we need a means of scripting various scenarios.

In rkt we use CNI network plugins to configure the network. Interestingly, several plugins can be used together to defines several network interfaces. What I’d like to see is a plugin added that allows one to configure traffic control in the network namespace of the container.

In order to integrate this into testing frameworks, the traffic control parameters should be dynamically adjustable, allowing for the scriptability mentioned above.

Stay tuned…

In a coming blog post, we’ll show that this is not only interesting when using rkt as an isolated component. It’s more interesting when tested in a container orchestration system like Kubernetes.

Follow Kinvolk on twitter to get notified when new blog posts go live.

Welcome rkt 1.0!

About 14 months ago, CoreOS announced their intention to build a new container runtime based on the App Container Specification, introduced at the same time. Over these past 14 months, the rkt team has worked to make rkt viable for production use and get to a point where we could offer certain stability guarantees. With today’s release of rkt 1.0, the rkt team believes we have reached that point.

We’d like to congratulate CoreOS on making it to this milestone and look forward to seeing rkt mature. With rkt, CoreOS has provided the community with a container runtime with first-class integration on modern Linux systems and a security-first approach.

We’d especially like to thank CoreOS for giving us the chance to be involved with rkt. Over the past months we’ve had the pleasure to make substantial contributions to rkt. Now that the 1.0 release is out, we look forward to continuing that, with even greater input from and collaboration with the community.

At Kinvolk, we want to push Linux forward by contributing to projects that are at the core of modern Linux systems. We believe that rkt is one of these technologies. We are especially happy that we could work to make the integration with systemd as seamless as possible. There’s still work on this front to do but we’re happy with where we’ve gotten so far.

rkt is so important because it fills a hole that was left by other container runtimes. It lets the operating system do what it does best, manage processes. We believe whole-heartedly when Lennart, creator and lead developer of the systemd project, states…

I believe in the rkt model. Integrating container and service management, so that there's a 1:1 mapping between containers and host services is an excellent idea. Resource management, introspection, life-cycle management of containers and services -- all that tightly integrated with the OS, that's how a container manager should be designed.

Lennart Poettering

Over the next few weeks, we’ll be posting a series of blog stories related to rkt. Follow Kinvolk on twitter to get notified when they go live and follow the story.

FOSDEM 2016 Wrap-up: Bowling with Containers

Another year, another trip to FOSDEM, arguably the best free & open source software event in the world, but definitely the best in Europe. FOSDEM offers an amazingly broad range of talks which is only surpassed by the richness of its hallway track… and maybe the legendary beer event. ;)

This year our focus was to talk to folks about rkt, the container runtime we work on with CoreOS, and meet more people from the container development community, along with the usual catching up with old friends.

On Saturday, Alban gave a talk with CoreOS’ Jon Boulle entitled “Container mechanics in rkt and Linux”, where Jon presented a general overview of the rkt project and Alban followed with a deep dive into how containers work on Linux, and in rkt specifically. The talk was very well attended. If you weren’t able to attend however, you can find the slides here.

For Saturday evening, we had organized a bowling event for some of the people involved in rkt, and containers on Linux in general. A majority of the people attending we’d not yet meet IRL. We finally got a chance to meet the team from Intel Poland who has been working on rkt’s LKVM stage1, the BlaBlaCar team—brave early adopters of rkt—as well as some folks from NTT and Virtuozzo. There were also a few folks we see quite often from Red Hat, Giant Swarm and of course the team from CoreOS. As it turns out, the best bowler was the aforementioned Jon Boulle, who bowled a very respectable score of 120.

Having taken the FOSDEM pilgrimage about 20 times collectively now, the Kinvolk team are veterans of the event. However, each year brings new, exciting topics of discussion. These are mostly shaped by one’s own interests (containers and SDN for us) but also by new trends within the community. We’re already excited to see what next year will bring. We hope to see you there!

Testing systemd Patches

It’s not so easy to test new patches for systemd. Because systemd is the first process started on boot, the traditional way to test was to install the new version on your own computer and reboot. However, this approach is not practical because it makes the development cycle quite long: after writing a few lines of code, I don’t want to close all my applications and reboot. There is also a risk that my patch contains some bugs and if I install systemd on my development computer, it won’t boot. It would then take even more time to fix it. All of this probably just to test a few lines of code.

This is of course not a new problem and systemd-nspawn was at first implemented in 2011 as a simple tool to test systemd in an isolated environment. During the years, systemd-nspawn grew in features and became more than a testing tool. Today, it is integrated with other components of the systemd project such as machinectl and it can pull container images or VM images, start them as systemd units. systemd-nspawn is also used as an internal component of the app container runtime, rkt.

When developing rkt, I often need to test patches in systemd-nspawn or other components of the systemd project like systemd-machined. And since systemd-nspawn uses recent features of the Linux kernel that are still being developed (cgroups, user namespaces, etc.), I also sometimes need to test a different kernel or a different machined. In this case, testing with systemd-nspawn does not help because I would still use the kernel installed on my computer and systemd-machined installed on my computer.

I still don’t want to reboot nor do I want to install a non-stable kernel or non-stable systemd patches on my development computer. So today I am explaining how I am testing new kernels and new systemd with kvmtool and debootstrap.

Getting kvmtool

Why kvmtool? I want to be able to install systemd in my test environment easily with just a “make install”. I don’t want to have to prepare a testing image for each test but instead just use the same filesystem.

$ cd ~/git
$ git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/will/kvmtool
$ cd kvmtool && make

Compiling a kernel

The kernel is compiled as usual but with the options listed in kvmtool’s README file (here’s the .config file I use). I just keep around the different versions of the kernels I want to test:

$ cd ~/git/linux
$ ls bzImage*
bzImage      bzImage-4.3         bzImage-cgroupns.v5  bzImage-v4.1-rc1-2-g1b852bc
bzImage-4.1  bzImage-4.3.0-rc4+  bzImage-v4.1-rc1     bzImage-v4.3-rc4-15-gf670268

Getting the filesystem for the test environment

The man page of systemd-nspawn explains how to install a minimal Fedora, Debian or Arch distribution in a directory with the dnf, debootstrap or pacstrap commands respectively.

sudo dnf -y --releasever=22 --nogpg --installroot=${HOME}/distro-trees/fedora-22 --disablerepo='*' --enablerepo=fedora install systemd passwd dnf fedora-release vim-minimal

Set the root password of your fedora 22 the first time, and then you are ready to boot it:

sudo systemd-nspawn -D ${HOME}/distro-trees/fedora-22 passwd

I don’t have to actually boot it with kvmtool to update the system. systemd-nspawn is enough:

sudo systemd-nspawn -D ${HOME}/distro-trees/fedora-22 dnf update

Installing systemd

$ cd ~/git/systemd
$ ./autogen.sh
$ ./configure CFLAGS='-g -O0 -ftrapv' --enable-compat-libs --enable-kdbus --sysconfdir=/etc --localstatedir=/var --libdir=/usr/lib64
$ make
$ sudo DESTDIR=$HOME/distro-trees/fedora-22 make install
$ sudo DESTDIR=$HOME/distro-trees/fedora-22/fedora-tree make install

As you notice, I am installing systemd both in ~/distro-trees/fedora-22 and ~/distro-trees/fedora-22/fedora-tree. The first one is for the VM started by kvmtool, and the second is for the container started by systemd-nspawn inside the VM.

Running a test

I can easily test my systemd patches quickly with various versions of the kernel and various Linux distributions. I can also start systemd-nspawn inside lkvm if I want to test the interaction between systemd, systemd-machined and systemd-nspawn. All of this, without rebooting or installing any unstable software on my main computer.

I am sourcing the following in my shell:

test_kvm() {
        distro=$1
        kernelver=$2
        kernelparams=$3

        kernelimg=${HOME}/git/linux/bzImage-${kernelver}
        distrodir=${HOME}/distro-trees/${distro}

        if [ ! -f $kernelimg -o ! -d $distrodir ] ; then
                echo "Usage: test_kvm distro kernelver kernelparams"
                echo "       test_kvm f22 4.3 systemd.unified_cgroup_hierarchy=1"
                return 1
        fi

        sudo ${HOME}/git/kvmtool/lkvm run --name ${distro}-${kernelver} \
                --kernel ${kernelimg} \
                --disk ${distrodir} \
                --mem 2048 \
                --network virtio \
                --params="${kernelparams}"
}

Then, I can just test rkt or systemd-nspawn with the unified cgroup hierarchy:

$ test_kvm fedora-22 4.3 systemd.unified_cgroup_hierarchy=1

Conclusion

With this setup, I could test cgroup namespaces in systemd-nspawn with the kernel patches that are being reviewed upstream and my systemd patches without rebooting or installing them on my development computer.