kinvolk logo | Blog

Using custom rkt stage1 images to test against various kernel versions

Introduction

When writing software that is tightly coupled with the Linux kernel, it is necessary to test on multiple versions of the kernel. This is relatively easy to do locally with VMs, but when working on open-source code hosted on Github, one wants to make these tests a part of the project’s continuous integration (CI) system to ensure that each pull request runs the tests and passes before merging.

Most CI systems run tests inside containers and, very sensibly, use various security mechanisms to restrict what the code being tested can access. While this does not cause problems for most use cases, it does for us. It blocks certain syscalls that are needed to, say, test a container runtime like rkt, or load ebpf programs into the kernel for testing, like we need to do to test gobpf and tcptracer-bpf. It also doesn’t allow us to run virtual machines which we need to be able to do to run tests on different versions of the kernel.

Finding a continuous integration service

While working on the rkt project, we did a survey of CI systems to find the ones that we could use to test rkt itself. Because of the above-stated requirements, it was clear that we needed one that gave us the option to run tests inside a virtual machine. This makes the list rather small; in fact, we were left with only SemaphoreCI.

SemaphoreCI supports running Docker inside of the test environment. This is possible because the test environment they provide for this is simply a VM. For rkt, this allowed us to run automatic tests for the container runtime each time a PR was submitted and/or changed.

However, it doesn’t solve the problem of testing on various kernels and kernel configurations as we want for gobpf and tcptracer-bpf. Luckily, this is where rkt and its KVM stage1 come to the rescue.

Our solution

To continuously test the work we are doing on Weave Scope, tpctracer-bpf and gobpf, we not only need a relatively new Linux kernel, but also require a subset of features like CONFIG_BPF=y or CONFIG_HAVE_KPROBES=y to be enabled.

With rkt’s KVM stage1 we can run our software in a virtual machine and, thanks to rkt’s modular architecture, build and use a custom stage1 suited to our needs. This allows us to run our tests on any platform that allows rkt to run; in our case, Semaphore CI.

Building a custom rkt stage1 for KVM

Our current approach relies on App Container Image (ACI) dependencies. All of our custom stage1 images are based on rkt’s coreos.com/rkt/stage1-kvm. In this way, we can apply changes to particular components (e.g. the Linux kernel) while reusing the other parts of the upstream stage1 image.

An ACI manifest template for such an image could look like the following.

{
        "acKind": "ImageManifest",
        "acVersion": "0.8.9",
        "name": "kinvolk.io/rkt/stage1-kvm-linux-{{kernel_version}}",
        "labels": [
                {
                        "name": "arch",
                        "value": "amd64"
                },
                {
                        "name": "os",
                        "value": "linux"
                },
                {
                        "name": "version",
                        "value": "0.1.0"
                }
        ],
        "annotations": [
                {
                        "name": "coreos.com/rkt/stage1/run",
                        "value": "/init"
                },
                {
                        "name": "coreos.com/rkt/stage1/enter",
                        "value": "/enter_kvm"
                },
                {
                        "name": "coreos.com/rkt/stage1/gc",
                        "value": "/gc"
                },
                {
                        "name": "coreos.com/rkt/stage1/stop",
                        "value": "/stop_kvm"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/add",
                        "value": "/app-add"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/rm",
                        "value": "/app-rm"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/start",
                        "value": "/app-start"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/stop",
                        "value": "/app-stop"
                },
                {
                        "name": "coreos.com/rkt/stage1/interface-version",
                        "value": "5"
                }
        ],
        "dependencies": [
                {
                        "imageName": "coreos.com/rkt/stage1-kvm",
                        "labels": [
                                {
                                        "name": "os",
                                        "value": "linux"
                                },
                                {
                                        "name": "arch",
                                        "value": "amd64"
                                },
                                {
                                        "name": "version",
                                        "value": "1.23.0"
                                }
                        ]
                }
        ]
}

Note: rkt doesn’t automatically fetch stage1 dependencies and we have to pre-fetch those manually.

To build a kernel (arch/x86/boot/bzImage), we use make bzImage after applying a single patch to the source tree. Without the patch, the kernel would block and not return control to rkt.

# change directory to kernel source tree
curl -LsS https://github.com/coreos/rkt/blob/v1.23.0/stage1/usr_from_kvm/kernel/patches/0001-reboot.patch -O
patch --silent -p1 < 0001-reboot.patch
# configure kernel
make bzImage

We now can combine the ACI manifest with a root filesystem holding our custom built kernel, for example:

aci/4.9.4/
├── manifest
└── rootfs
    └── bzImage

We are now ready to build the stage1 ACI with actool:

actool build --overwrite aci/4.9.4 my-custom-stage1-kvm.aci

Run rkt with a custom stage1 for KVM

rkt offers multiple command line flags to be provided with a stage1; we use --stage1-path=. To smoke test our newly built stage1, we run a Debian Docker container and call uname -r so we make sure our custom built kernel is actually used:

rkt fetch image coreos.com/rkt/stage1-kvm:1.23.0 # due to rkt issue #2241
rkt run \
  --insecure-options=image \
  --stage1-path=./my-custom-stage1-kvm.aci \
  docker://debian --exec=/bin/uname -- -r
4.9.4-kinvolk-v1
[...]

We set CONFIG_LOCALVERSION="-kinvolk-v1" in the kernel config and the version is correctly shown as 4.9.4-kinvolk-v1.

Run on Semaphore CI

Semaphore does not include rkt by default on their platform. Hence, we have to download rkt in semaphore.sh as a first step:

#!/bin/bash

readonly rkt_version="1.23.0"

if [[ ! -f "./rkt/rkt" ]] || \
  [[ ! "$(./rkt/rkt version | awk '/rkt Version/{print $3}')" == "${rkt_version}" ]]; then

  curl -LsS "https://github.com/coreos/rkt/releases/download/v${rkt_version}/rkt-v${rkt_version}.tar.gz" \
    -o rkt.tgz

  mkdir -p rkt
  tar -xvf rkt.tgz -C rkt --strip-components=1
fi

[...]

After that we can pre-fetch the stage1 image we depend on and then run our tests. Note that we now use ./rkt/rkt. And we use timeout to make sure our tests fail if they cannot be finished in a reasonable amount of time.

Example:

sudo ./rkt/rkt image fetch --insecure-options=image coreos.com/rkt/stage1-kvm:1.23.0
sudo timeout --foreground --kill-after=10 5m \
  ./rkt/rkt \
  --uuid-file-save=./rkt-uuid \
  --insecure-options=image,all-run \
  --stage1-path=./rkt/my-custom-stage1-kvm.aci \
  ...
  --exec=/bin/sh -- -c \
  'cd /go/... ; \
    go test -v ./...'

--uuid-file-save=./rkt-uuid is required to determine the UUID of the started container from semaphore.sh to read its exit status (since it is not propagated on the KVM stage1) after the test finished and exit accordingly:

[...]

test_status=$(sudo ./rkt/rkt status $(<rkt-uuid) | awk '/app-/{split($0,a,"=")} END{print a[2]}')
exit $test_status

Bind mount directories from stage1 into stage2

If you want to provide data to stage2 from stage1 you can do this with a small systemd drop-in unit to bind mount the directories. This allows you to add or modify content without actually touching the stage2 root filesystem.

We did the following to provide the Linux kernel headers to stage2:

# add systemd drop-in to bind mount kernel headers
mkdir -p "${rootfs_dir}/etc/systemd/system/prepare-app@.service.d"
cat <<EOF >"${rootfs_dir}/etc/systemd/system/prepare-app@.service.d/10-bind-mount-kernel-header.conf"
[Service]
ExecStartPost=/usr/bin/mkdir -p %I/${kernel_header_dir}
ExecStartPost=/usr/bin/mount --bind "${kernel_header_dir}" %I/${kernel_header_dir}
EOF

Note: for this to work you need to have mkdir in stage1, which is not included in the default rkt stage1-kvm. We use the one from busybox: https://busybox.net/downloads/binaries/1.26.2-i686/busybox_MKDIR

Automating the steps

We want to be able to do this for many kernel versions. Thus, we have created a tool, stage1-builder, that does most of this for us. With stage1-builder you simply need to add the kernel configuration to the config directory and run the ./builder script. The result is an ACI file containing our custom kernel with a dependency on the upstream kvm-stage1.

Conclusion

With SemaphoreCI providing us with a proper VM and rkt’s modular stage1 architecture, we have put together a CI pipeline that allows us to test gobpf and tcptracer-bpf on various kernels. In our opinion this setup is much preferable to the alternative, setting up and maintaining Jenkins.

Interesting to point out is that we did not have to use or make changes to rkt’s build system. Leveraging ACI dependencies was all we needed to swap out the KVM stage1 kernel. For the simple case of testing software on various kernel versions, rkt’s modular design has proven to be very useful.

Kinvolk Presenting at FOSDEM 2017

The same procedure as last year, Miss Sophie?
The same procedure as every year, James!

As with every year, we’ve reserved the first weekend of February to attend FOSDEM, the premier open-source conference in Europe. We’re looking forward to having drinks and chatting with other open-source contributors and enthusiasts.

But it’s not all fun and games for us. The Kinvolk team has three talks; one each in the Go, Testing and Automation, & Linux Containers and Microservices devrooms.

The talks

We look forward to sharing our work and having conversations about the following topics…

If you’ll be there and are interested in those, or other projects we work on, please do track us down.

We look forward to seeing you there!

Introducing gobpf - Using eBPF from Go

What is eBPF?

eBPF is a “bytecode virtual machine” in the Linux kernel that is used for tracing kernel functions, networking, performance analysis and more. Its roots lay in the Berkley Packet Filter (sometimes called LSF, Linux Socket Filtering), but as it supports more operations (e.g. BPF_CALL 0x80 /* eBPF only: function call */) and nowadays has much broader use than packet filtering on a socket, it’s called extended BPF.

With the addition of the dedicated bpf() syscall in Linux 3.18, it became easier to perform the various eBPF operations. Further, the BPF compiler collection from the IO Visor Project and its libbpf provide a rich set of helper functions as well as Python bindings that make it more convenient to write eBPF powered tools.

To get an idea of how eBPF looks, let’s take a peek at struct bpf_insn prog[] - a list of instructions in pseudo-assembly. Below we have a simple user-space C program to count the number of fchownat(2) calls. We use bpf_prog_load from libbpf to load the eBPF instructions as a kprobe and use bpf_attach_kprobe to attach it to the syscall. Now each time fchownat is called, the kernel executes the eBPF program. The program loads the map (more about maps later), increments the counter and exits. In the C program, we read the value from the map and print it every second.

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <linux/version.h>

#include <bcc/bpf_common.h>
#include <bcc/libbpf.h>

int main() {
	int map_fd, prog_fd, key=0, ret;
	long long value;
	char log_buf[8192];
	void *kprobe;

	/* Map size is 1 since we store only one value, the chown count */
	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1);
	if (map_fd < 0) {
		fprintf(stderr, "failed to create map: %s (ret %d)\n", strerror(errno), map_fd);
		return 1;
	}

	ret = bpf_update_elem(map_fd, &key, &value, 0);
	if (ret != 0) {
		fprintf(stderr, "failed to initialize map: %s (ret %d)\n", strerror(errno), ret);
		return 1;
	}

	struct bpf_insn prog[] = {
		/* Put 0 (the map key) on the stack */
		BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
		/* Put frame pointer into R2 */
		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
		/* Decrement pointer by four */
		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
		/* Put map_fd into R1 */
		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
		/* Load current count from map into R0 */
		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
			     BPF_FUNC_map_lookup_elem),
		/* If returned value NULL, skip two instructions and return */
		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
		/* Put 1 into R1 */
		BPF_MOV64_IMM(BPF_REG_1, 1),
		/* Increment value by 1 */
		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
		/* Return from program */
		BPF_EXIT_INSN(),
	};

	prog_fd = bpf_prog_load(BPF_PROG_TYPE_KPROBE, prog, sizeof(prog), "GPL", LINUX_VERSION_CODE, log_buf, sizeof(log_buf));
	if (prog_fd < 0) {
		fprintf(stderr, "failed to load prog: %s (ret %d)\ngot CAP_SYS_ADMIN?\n%s\n", strerror(errno), prog_fd, log_buf);
		return 1;
	}

	kprobe = bpf_attach_kprobe(prog_fd, "p_sys_fchownat", "p:kprobes/p_sys_fchownat sys_fchownat", -1, 0, -1, NULL, NULL);
	if (kprobe == NULL) {
		fprintf(stderr, "failed to attach kprobe: %s\n", strerror(errno));
		return 1;
	}

	for (;;) {
		ret = bpf_lookup_elem(map_fd, &key, &value);
		if (ret != 0) {
			fprintf(stderr, "failed to lookup element: %s (ret %d)\n", strerror(errno), ret);
		} else {
			printf("fchownat(2) count: %lld\n", value);
		}
		sleep(1);
	}

	return 0;
}

The example requires libbcc and can be compiled with:

gcc -I/usr/include/bcc/compat main.c -o chowncount -lbcc

Nota bene: the increment in the example code is not atomic. In real code, we would have to use one map per CPU and aggregate the result.

It is important to know that eBPF programs run directly in the kernel and that their invocation depends on the type. They are executed without change of context. As we have seen above, kprobes for example are triggered whenever the kernel executes a specified function.

Thanks to clang and LLVM, it’s not necessary to actually write plain eBPF instructions. Modules can be written in C and use functions provided by libbpf (as we will see in the gobpf example below).

eBPF Program Types

The type of an eBPF program defines properties like the kernel helper functions available to the program or the input it receives from the kernel. Linux 4.8 knows the following program types:

// https://github.com/torvalds/linux/blob/v4.8/include/uapi/linux/bpf.h#L90-L98
enum bpf_prog_type {
	BPF_PROG_TYPE_UNSPEC,
	BPF_PROG_TYPE_SOCKET_FILTER,
	BPF_PROG_TYPE_KPROBE,
	BPF_PROG_TYPE_SCHED_CLS,
	BPF_PROG_TYPE_SCHED_ACT,
	BPF_PROG_TYPE_TRACEPOINT,
	BPF_PROG_TYPE_XDP,
};

A program of type BPF_PROG_TYPE_SOCKET_FILTER, for instance, receives a struct __sk_buff * as its first argument whereas it’s struct pt_regs * for programs of type BPF_PROG_TYPE_KPROBE.

eBPF Maps

Maps are a “generic data structure for storage of different types of data” and can be used to share data between eBPF programs as well as between kernel and userspace. The key and value of a map can be of arbitrary size as defined when creating the map. The user also defines the maximum number of entries (max_entries). Linux 4.8 knows the following map types:

// https://github.com/torvalds/linux/blob/v4.8/include/uapi/linux/bpf.h#L78-L88
enum bpf_map_type {
	BPF_MAP_TYPE_UNSPEC,
	BPF_MAP_TYPE_HASH,
	BPF_MAP_TYPE_ARRAY,
	BPF_MAP_TYPE_PROG_ARRAY,
	BPF_MAP_TYPE_PERF_EVENT_ARRAY,
	BPF_MAP_TYPE_PERCPU_HASH,
	BPF_MAP_TYPE_PERCPU_ARRAY,
	BPF_MAP_TYPE_STACK_TRACE,
	BPF_MAP_TYPE_CGROUP_ARRAY,
};

While BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_ARRAY are generic maps for different types of data, BPF_MAP_TYPE_PROG_ARRAY is a special purpose array map. It holds file descriptors referring to other eBPF programs and can be used by an eBPF program to “replace its own program flow with the one from the program at the given program array slot”. The BPF_MAP_TYPE_PERF_EVENT_ARRAY map is for storing a data of type struct perf_event in a ring buffer.

In the example above we used a map of type hash with a size of 1 to hold the call counter.

gobpf

In the context of the work we are doing on Weave Scope for Weaveworks, we have been working extensively with both eBPF and Go. As Scope is written in Go, it makes sense to use eBPF directly from Go.

In looking at how to do this, we stumbled upon some code in the IO Visor Project that looked like a good starting point. After talking to the folks at the project, we decided to move this out into a dedicated repository: https://github.com/iovisor/gobpf gobpf is a Go library that leverages the bcc project to make working with eBPF programs from Go simple.

To get an idea of how this works, the following example chrootsnoop shows how to use a bpf.PerfMap to monitor chroot(2) calls:

package main

import (
	"bytes"
	"encoding/binary"
	"fmt"
	"os"
	"os/signal"
	"unsafe"

	"github.com/iovisor/gobpf"
)

import "C"

const source string = `
#include <uapi/linux/ptrace.h>
#include <bcc/proto.h>

typedef struct {
	u32 pid;
	char comm[128];
	char filename[128];
} chroot_event_t;

BPF_PERF_OUTPUT(chroot_events);

int kprobe__sys_chroot(struct pt_regs *ctx, const char *filename)
{
	u64 pid = bpf_get_current_pid_tgid();
	chroot_event_t event = {
		.pid = pid >> 32,
	};
	bpf_get_current_comm(&event.comm, sizeof(event.comm));
	bpf_probe_read(&event.filename, sizeof(event.filename), (void *)filename);
	chroot_events.perf_submit(ctx, &event, sizeof(event));
	return 0;
}
`

type chrootEvent struct {
	Pid      uint32
	Comm     [128]byte
	Filename [128]byte
}

func main() {
	m := bpf.NewBpfModule(source, []string{})
	defer m.Close()

	chrootKprobe, err := m.LoadKprobe("kprobe__sys_chroot")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to load kprobe__sys_chroot: %s\n", err)
		os.Exit(1)
	}

	err = m.AttachKprobe("sys_chroot", chrootKprobe)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to attach kprobe__sys_chroot: %s\n", err)
		os.Exit(1)
	}

	chrootEventsTable := bpf.NewBpfTable(0, m)

	chrootEventsChannel := make(chan []byte)

	chrootPerfMap, err := bpf.InitPerfMap(chrootEventsTable, chrootEventsChannel)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to init perf map: %s\n", err)
		os.Exit(1)
	}

	sig := make(chan os.Signal, 1)
	signal.Notify(sig, os.Interrupt, os.Kill)

	go func() {
		var chrootE chrootEvent
		for {
			data := <-chrootEventsChannel
			err := binary.Read(bytes.NewBuffer(data), binary.LittleEndian, &chrootE)
			if err != nil {
				fmt.Fprintf(os.Stderr, "Failed to decode received chroot event data: %s\n", err)
				continue
			}
			comm := (*C.char)(unsafe.Pointer(&chrootE.Comm))
			filename := (*C.char)(unsafe.Pointer(&chrootE.Filename))
			fmt.Printf("pid %d %s called chroot(2) on %s\n", chrootE.Pid, C.GoString(comm), C.GoString(filename))
		}
	}()

	chrootPerfMap.Start()
	<-sig
	chrootPerfMap.Stop()
}

You will notice that our eBPF program is written in C for this example. The bcc project uses clang to convert the code to eBPF instructions.

We don’t have to interact with libbpf directly from our Go code, as gobpf implements a callback and makes sure we receive the data from our eBPF program through the chrootEventsChannel.

To test the example, you can run it with sudo -E go run chrootsnoop.go and for instance execute any systemd unit with RootDirectory statement. A simple chroot ... also works, of course.

# hello.service
[Unit]
Description=hello service

[Service]
RootDirectory=/tmp/chroot
ExecStart=/hello

[Install]
WantedBy=default.target

You should see output like:

pid 7857 (hello) called chroot(2) on /tmp/chroot

Conclusion

With its growing capabilities, eBPF has become an indispensable tool for modern Linux system software. gobpf helps you to conveniently use libbpf functionality from Go.

gobpf is in a very early stage, but usable. Input and contributions are very much welcome.

If you want to learn more about our use of eBPF in software like Weave Scope, stay tuned and have a look at our work on GitHub: https://github.com/kinvolk

Follow Kinvolk on Twitter to get notified when new blog posts go live.

Testing web services with traffic control on Kubernetes

This is part 2 of our “testing applications with traffic control series”. See part 1, testing degraded network scenarios with rkt, for detailed information about how traffic control works on Linux.

In this installment we demonstrate how to test web services with traffic control on Kubernetes. We introduce tcd, a simple traffic control daemon developed by Kinvolk for this demo. Our demonstration system runs on Openshift 3, Red Hat’s Container Platform based on Kubernetes, and uses the excellent Weave Scope, an interactive container monitoring and visualization tool.

We’ll be giving a live demonstration of this at the OpenShift Commons Briefing on May 26th, 2016. Please join us there.

The premise

As discussed in part 1 of this series, tests generally run under optimal networking conditions. This means that standard testing procedures neglect a whole bevy of issues that can arise due to poor network conditions.

Would it not be prudent to also test that your services perform satisfactorily when there is, for example, high packet loss, high latency, a slow rate of transmission, or a combination of those? We think so, and if you do too, please read on.

Traffic control on a distributed system

Let’s now make things more concrete by using tcd in our Kubernetes cluster.

The setup

To get started, we need to start an OpenShift ready VM to provide us our Kubernetes cluster. We’ll then create an OpenShift project and do some configuration.

If you want to follow along, you can go to our demo repository which will guide you through installing and setting up things. The pieces Before diving into the traffic control demo, we want to give you a really quick overview of tcd, OpenShift and Weave Scope.

tcd (traffic control daemon)

tcd is a simple daemon that runs on each Kubernetes node and responds to API calls. tcd manipulates the traffic control settings of the pods using the tc command which we briefly mentioned in part 1. It’s decoupled from the service being tested, meaning you can stop and restart the daemon on a pod without affecting its connectivity.

In this demo, it receives commands from buttons exposed in Weave Scope.

OpenShift

OpenShift is Red Hat’s container platform that makes it simple to build, deploy, manage and secure containerized applications at scale on any cloud infrastructure, including Red Hat’s own hosted offering, OpenShift Dedicated. Version 3 of OpenShift uses Kubernetes under the hood to maintain cluster health and easily scale services.

In the following figure, you see an example of the OpenShift dashboard with the running pods.

Here we have 1 Weave Scope App pod, 3 ping test pods, 1 tcd pod, and one Weave Scope App. Using the arrow buttons one can scale the application up and down and the circle changes color depending on the status of the application (e.g. scaling, terminating, etc.).

Weave Scope

Weave Scope helps to intuitively understand, monitor, and control containerized applications. It visually represents pods and processes running on Kubernetes and allows one to drill into pods, showing information such as CPU & memory usage, running processes, etc. One can also stop, start, and interact with containerized applications directly through its UI.

While this graphic shows Weave Scope displaying containers, we see at the top that we can also display information about processes and hosts.

How the pieces fit together

Now that we understand the individual pieces, let’s see how it all works together. Below is a diagram of our demo system.

Here we have 2 Kubernetes nodes each running one instance of the tcd daemon. tcd can only manage the traffic control settings of pods local to the Kubernetes node on which it’s running, thus the need for one per node.

On the right we see the Weave Scope app showing details for the selected pod; in this case, the one being pointed to by (4). In the red oval, we see the three buttons we’ve added to Scope app for this demo. These set the network connectivity parameters of the selected pod’s egress traffic to a latency of 2000ms, 300ms, 1ms, respectively, from left to right.

When clicked (1), the scope app sends a message (2) to the Weave Scope probe running on the selected pod’s Kubernetes node. The Weave Scope probe sends a gRPC message (3) to the tcd daemon, in this case a ConfigureEgressMethod message, running on its Kubernetes node telling it to configure the pods egress traffic (4) accordingly.

While this demo only configures the latency, tcd can also be used to configure the bandwidth and the percentage of packet drop. As we saw in part 1, those parameters are features directly provided by the Linux netem queuing discipline.

Being able to dynamically change the network characteristics for each pod, we can observe the behaviour of services during transitions as well as in steady state. Of course, by observe we mean test,which we’ll turn to now.

Testing with traffic control

Now for 2 short demos to show how traffic control can be used for testing.

Ping test

This is a contrived demo to show that the setup works and we can, in fact, manipulate the egress traffic characteristics of a pod.

The following video shows a pod downloading a small file from Internet with the wget command, with the target host being the one for which we are adjusting the packet latency.

It should be easy to see the affects that adjusting the latency has; with greater latency it takes longer to get a reply.

Guestbook app test

We use the Kubernetes guestbook example for our next, more real-world, demo. Some small modifications have been made to provide user-feedback when the reply from the web server takes a long time, showing a “loading…” message. Generally, this type of thing goes untested because, as we mentioned in the introduction, our tests run under favorable networking conditions.

Tools like Selenium and agouti allow for testing web applications in an automated way without manually interacting with a browser. For this demo we’ll be using agouti with its Chrome backend so that we can see the test run.

In the following video we see this feature being automatically tested by a Go script using the Ginkgo testing framework and Gomega matcher library.

In this demo, testers still need to configure the traffic control latency manually by clicking on the Weave Scope app buttons before running the test. However, since tcd can accept commands over gRPC, the Go script could easily connect to tcd to perform that configuration automatically, and dynamically, at run time. We’ll leave that as an exercise for the reader. :)

Conclusion

With Kubernetes becoming a defacto building block of modern container platforms, we now have a basis on which to start integrating features in a standardized way that have long gone ignored. We think traffic control for testing, and other creative endeavors, is a good example of this.

If you’re interested in moving this forward, we encourage you to take what we’ve started and run with it. And whether you just want to talk to us about this or you need professional support in your efforts, we’d be happy to talk to you.

Thanks to…

We’d like to thank Ilya & Tom from Weaveworks and Jorge & Ryan from Red Hat for helping us with some technical issues we ran into while setting up this demo. And a special thanks to Diane from the OpenShift project for helping coordinate the effort.

Introducing systemd.conf 2016

The systemd project will be having its 2nd conference—systemd.conf—from Sept. 28th to Oct. 1st, once again at betahaus in Berlin. After the success of last year’s conference, we’re looking forward to having much of the systemd community in Berlin for a second consecutive year. As this year’s event takes place just before LinuxCon Europe, we’re expecting some new faces.

Kinvolk’s involvement

As an active user and contributor to systemd, currently through our work on rkt, we’re interested in promoting systemd and helping provide a place for the systemd community to gather.

Last year, Kinvolk helped with much of the organization. This year, we’re happy to be expanding our involvement to include handling the financial-side of the event.

In general, Kinvolk is willing to help provide support to open source projects who want to hold events in Berlin. Just send us a mail to [email protected]

Don’t fix what isn’t broken

As feedback from last year’s post–conference survey showed, most attendees were pleased with the format. Thus, this year very little will change. The biggest difference is that we’re adding another room to accommodate a few more people and to facilitate impromptu breakout sessions. Some other small changes are that we’ll have warm lunches instead of sandwiches and we’ve dropped the speakers dinner as we felt it wasn’t in line with the goal of bringing all attendees together.

Workshop day

A new addition to systemd.conf, is the workshop day. The audience for systemd.conf 2015 was predominantly systemd contributors and proficient users. This was very much expected and intended.

However, we also want to give people of varying familiarity with the systemd project the chance to learn more from the people who know it best. The workshop day is intended to facilitate this. The call for presentations (CfP) will include a call for workshop sessions. These workshop sessions will be 2 to 3-hour hands-on sessions covering various areas of, or related to, the systemd project. You can consider it a day of systemd training if that helps with getting approval to attend. :)

As we expect a different audience for workshops than for the presentation and hackfest days, we will be issuing separate tickets. Tickets will become available once the call for participation opens.

Get involved!

There are several ways you can help make systemd.conf 2016 a success.

Become a sponsor

These events are only possible with the support of sponsors. In addition to helping the event be more awesome, your sponsorship allows us to bring more of the community together by sponsoring the attendance of those community member that need financial assistance to attend.

See the systemd.conf 2016 website for how to become a sponsor.

Submitting talk and workshop proposals

systemd.conf is only as good as the people who attend and the content they provide. In a few weeks we’ll be announcing the opening of the CfP. If you, or your organization, is doing interesting things with systemd, we encourage you to submit a proposal. If you want to spread your knowledge of systemd with others, please consider submitting a proposal for a workshop session.

We’re excited about what this year’s event will bring and look forward to seeing you at systemd.conf 2016!

Testing Degraded Network Scenarios with rkt

The current state of testing

Testing applications is important. Some even go as far as saying, “If it isn’t tested, it doesn’t work”. While that may have both a degree of truth and untruth to it, the rise of continuous integration (CI) and automated testing have shown that the software industry is taking testing seriously.

However, there is at least one area of testing that is difficult to automate and, thus, hasn’t been adequately incorporated into testing scenarios: poor network connectivity.

The typical testing process has the developer as the first line of defence. Developers usually work within reliable networking conditions. The developers then submit their code to a CI system which also runs tests under good networking conditions. Once the CI system goes green, internal testing is usually done; ship it!

Nowhere in this process were scenarios tested where your application experiences degraded network conditions. If your internal tests don’t cover these scenarios then it’s your users who’ll be doing the testing. This is far from an ideal situation and goes against the “test early test often” mantra of CI; a bug will cost you more the later it’s caught.

Three examples

To make this more concrete, let’s look at a few examples where users might notice issues that you, or your testing infrastructure, may not:

  • A web shop you click on “buy”, it redirects to a new page but freezes because of a connection issue. The user does not get feedback whether the javascript code will try again automatically; the user does not know whether she should refresh. That’s a bug. Once fixed, how do you test it? You need to break the connection just before the test script clicks on the “buy” link.
  • A video stream server The Real-Time Protocol (RTP) uses UDP packets. If some packets drop or arrive too late, it’s not a big deal; the video player will display a degraded video because of the missing packets but the stream will otherwise play just fine. Or, will it? So how can the developers of a video stream server test a scenario where 3% of packets are dropped or delayed?
  • Applications like etcd or zookeeper implement a consensus protocol. They should be designed to handle a node disconnecting from the network and network splits. See the approach CoreOS takes for an example.

It doesn’t take much imagination to come up with more, but these should be enough to make the point.

Where Linux can help

What functionality does the Linux kernel provide to enable us to test these scenarios?

Linux provides a means to shape both the egress traffic (emitted by a network interface) and to some extend the ingress traffic (received by a network interface). This is done by way of qdiscs, short for queuing disciplines. In essence, a qdisc is a packet scheduler. Using different qdiscs we can change the way packets are scheduled. qdiscs can have associated classes and filters. These all combine to let us delay, drop, or rate-limit packets, among a host of other things. A complete description is out of the scope of this blog post.

For our purposes, we’ll just look at one qdisc called “netem”, short for network emulation. This will allow us to tweak the packet scheduling characteristics we want.

What about containers?

Up to this point we haven’t even mentioned containers. That’s because the story is the same with regards to traffic control whether we’re talking about bare-metal servers, VMs or containers. Containers reside in their own network namespace, providing the container with a completely isolated network. Thus, the traffic between containers, or between a container and the host, can all be shaped in the same way.

Testing the idea

As a demonstration I’ve created a simple demo that starts an RTP server in a container using rkt. In order to easily tweak network parameters, I’ve hacked up a GUI written in Gtk/Javascript. And finally, to see the results we just need to point a video player to our RTP server.

We’ll step through the demo below. But if you want to play along at home, you can find the code in the kinvolk/demo repo on Github

Running the demo

First, I start the video streaming server in a rkt pod. The server streams the Elephant Dreams movie to a media player via the RTP/RTSP protocol. RTSP uses a TCP connection to send commands to the server. Examples of commands are choosing the file to play or seeking to a point in the middle of the stream. RTP it what actually sends the video via UDP packets.

Second, we start the GUI to dynamically change some parameters of the network emulator. What this does is connect to the rkt network namespace and change the egress qdisc using Linux’s tc command.

Now we can adjust the values as we like. For example, when I add 5% packet loss, the quality is degraded but not interrupted. When I remove the packet loss, the video becomes clear again. When I add 10s latency in the network, the video freezes. Play the video to see this in action.

What this shows us is that traffic control can be used effectively with containers to test applications - in this case a media server.

Next steps

The drawback to this approach is that it’s still manual. For automated testing we don’t want a GUI. Rather, we need a means of scripting various scenarios.

In rkt we use CNI network plugins to configure the network. Interestingly, several plugins can be used together to defines several network interfaces. What I’d like to see is a plugin added that allows one to configure traffic control in the network namespace of the container.

In order to integrate this into testing frameworks, the traffic control parameters should be dynamically adjustable, allowing for the scriptability mentioned above.

Stay tuned…

In a coming blog post, we’ll show that this is not only interesting when using rkt as an isolated component. It’s more interesting when tested in a container orchestration system like Kubernetes.

Follow Kinvolk on twitter to get notified when new blog posts go live.

Welcome rkt 1.0!

About 14 months ago, CoreOS announced their intention to build a new container runtime based on the App Container Specification, introduced at the same time. Over these past 14 months, the rkt team has worked to make rkt viable for production use and get to a point where we could offer certain stability guarantees. With today’s release of rkt 1.0, the rkt team believes we have reached that point.

We’d like to congratulate CoreOS on making it to this milestone and look forward to seeing rkt mature. With rkt, CoreOS has provided the community with a container runtime with first-class integration on modern Linux systems and a security-first approach.

We’d especially like to thank CoreOS for giving us the chance to be involved with rkt. Over the past months we’ve had the pleasure to make substantial contributions to rkt. Now that the 1.0 release is out, we look forward to continuing that, with even greater input from and collaboration with the community.

At Kinvolk, we want to push Linux forward by contributing to projects that are at the core of modern Linux systems. We believe that rkt is one of these technologies. We are especially happy that we could work to make the integration with systemd as seamless as possible. There’s still work on this front to do but we’re happy with where we’ve gotten so far.

rkt is so important because it fills a hole that was left by other container runtimes. It lets the operating system do what it does best, manage processes. We believe whole-heartedly when Lennart, creator and lead developer of the systemd project, states…

I believe in the rkt model. Integrating container and service management, so that there's a 1:1 mapping between containers and host services is an excellent idea. Resource management, introspection, life-cycle management of containers and services -- all that tightly integrated with the OS, that's how a container manager should be designed.

Lennart Poettering

Over the next few weeks, we’ll be posting a series of blog stories related to rkt. Follow Kinvolk on twitter to get notified when they go live and follow the story.

FOSDEM 2016 Wrap-up: Bowling with Containers

Another year, another trip to FOSDEM, arguably the best free & open source software event in the world, but definitely the best in Europe. FOSDEM offers an amazingly broad range of talks which is only surpassed by the richness of its hallway track… and maybe the legendary beer event. ;)

This year our focus was to talk to folks about rkt, the container runtime we work on with CoreOS, and meet more people from the container development community, along with the usual catching up with old friends.

On Saturday, Alban gave a talk with CoreOS’ Jon Boulle entitled “Container mechanics in rkt and Linux”, where Jon presented a general overview of the rkt project and Alban followed with a deep dive into how containers work on Linux, and in rkt specifically. The talk was very well attended. If you weren’t able to attend however, you can find the slides here.

For Saturday evening, we had organized a bowling event for some of the people involved in rkt, and containers on Linux in general. A majority of the people attending we’d not yet meet IRL. We finally got a chance to meet the team from Intel Poland who has been working on rkt’s LKVM stage1, the BlaBlaCar team—brave early adopters of rkt—as well as some folks from NTT and Virtuozzo. There were also a few folks we see quite often from Red Hat, Giant Swarm and of course the team from CoreOS. As it turns out, the best bowler was the aforementioned Jon Boulle, who bowled a very respectable score of 120.

Having taken the FOSDEM pilgrimage about 20 times collectively now, the Kinvolk team are veterans of the event. However, each year brings new, exciting topics of discussion. These are mostly shaped by one’s own interests (containers and SDN for us) but also by new trends within the community. We’re already excited to see what next year will bring. We hope to see you there!

Testing systemd Patches

It’s not so easy to test new patches for systemd. Because systemd is the first process started on boot, the traditional way to test was to install the new version on your own computer and reboot. However, this approach is not practical because it makes the development cycle quite long: after writing a few lines of code, I don’t want to close all my applications and reboot. There is also a risk that my patch contains some bugs and if I install systemd on my development computer, it won’t boot. It would then take even more time to fix it. All of this probably just to test a few lines of code.

This is of course not a new problem and systemd-nspawn was at first implemented in 2011 as a simple tool to test systemd in an isolated environment. During the years, systemd-nspawn grew in features and became more than a testing tool. Today, it is integrated with other components of the systemd project such as machinectl and it can pull container images or VM images, start them as systemd units. systemd-nspawn is also used as an internal component of the app container runtime, rkt.

When developing rkt, I often need to test patches in systemd-nspawn or other components of the systemd project like systemd-machined. And since systemd-nspawn uses recent features of the Linux kernel that are still being developed (cgroups, user namespaces, etc.), I also sometimes need to test a different kernel or a different machined. In this case, testing with systemd-nspawn does not help because I would still use the kernel installed on my computer and systemd-machined installed on my computer.

I still don’t want to reboot nor do I want to install a non-stable kernel or non-stable systemd patches on my development computer. So today I am explaining how I am testing new kernels and new systemd with kvmtool and debootstrap.

Getting kvmtool

Why kvmtool? I want to be able to install systemd in my test environment easily with just a “make install”. I don’t want to have to prepare a testing image for each test but instead just use the same filesystem.

$ cd ~/git
$ git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/will/kvmtool
$ cd kvmtool && make

Compiling a kernel

The kernel is compiled as usual but with the options listed in kvmtool’s README file (here’s the .config file I use). I just keep around the different versions of the kernels I want to test:

$ cd ~/git/linux
$ ls bzImage*
bzImage      bzImage-4.3         bzImage-cgroupns.v5  bzImage-v4.1-rc1-2-g1b852bc
bzImage-4.1  bzImage-4.3.0-rc4+  bzImage-v4.1-rc1     bzImage-v4.3-rc4-15-gf670268

Getting the filesystem for the test environment

The man page of systemd-nspawn explains how to install a minimal Fedora, Debian or Arch distribution in a directory with the dnf, debootstrap or pacstrap commands respectively.

sudo dnf -y --releasever=22 --nogpg --installroot=${HOME}/distro-trees/fedora-22 --disablerepo='*' --enablerepo=fedora install systemd passwd dnf fedora-release vim-minimal

Set the root password of your fedora 22 the first time, and then you are ready to boot it:

sudo systemd-nspawn -D ${HOME}/distro-trees/fedora-22 passwd

I don’t have to actually boot it with kvmtool to update the system. systemd-nspawn is enough:

sudo systemd-nspawn -D ${HOME}/distro-trees/fedora-22 dnf update

Installing systemd

$ cd ~/git/systemd
$ ./autogen.sh
$ ./configure CFLAGS='-g -O0 -ftrapv' --enable-compat-libs --enable-kdbus --sysconfdir=/etc --localstatedir=/var --libdir=/usr/lib64
$ make
$ sudo DESTDIR=$HOME/distro-trees/fedora-22 make install
$ sudo DESTDIR=$HOME/distro-trees/fedora-22/fedora-tree make install

As you notice, I am installing systemd both in ~/distro-trees/fedora-22 and ~/distro-trees/fedora-22/fedora-tree. The first one is for the VM started by kvmtool, and the second is for the container started by systemd-nspawn inside the VM.

Running a test

I can easily test my systemd patches quickly with various versions of the kernel and various Linux distributions. I can also start systemd-nspawn inside lkvm if I want to test the interaction between systemd, systemd-machined and systemd-nspawn. All of this, without rebooting or installing any unstable software on my main computer.

I am sourcing the following in my shell:

test_kvm() {
        distro=$1
        kernelver=$2
        kernelparams=$3

        kernelimg=${HOME}/git/linux/bzImage-${kernelver}
        distrodir=${HOME}/distro-trees/${distro}

        if [ ! -f $kernelimg -o ! -d $distrodir ] ; then
                echo "Usage: test_kvm distro kernelver kernelparams"
                echo "       test_kvm f22 4.3 systemd.unified_cgroup_hierarchy=1"
                return 1
        fi

        sudo ${HOME}/git/kvmtool/lkvm run --name ${distro}-${kernelver} \
                --kernel ${kernelimg} \
                --disk ${distrodir} \
                --mem 2048 \
                --network virtio \
                --params="${kernelparams}"
}

Then, I can just test rkt or systemd-nspawn with the unified cgroup hierarchy:

$ test_kvm fedora-22 4.3 systemd.unified_cgroup_hierarchy=1

Conclusion

With this setup, I could test cgroup namespaces in systemd-nspawn with the kernel patches that are being reviewed upstream and my systemd patches without rebooting or installing them on my development computer.