kinvolk logo | Blog

Running Kubernetes on Travis CI with minikube

It is not easily possible to run Kubernetes on Travis CI, as most methods of setting up a cluster need to create resources on AWS, or another cloud provider. And setting up VMs is also not possible as Travis CI doesn’t allow nested virtualization. This post explains how to use minikube without additional resources, with a few simple steps.

Our use case

As we are currently working with Chef on a project to integrate Habitat with Kubernetes(Habitat Operator), we needed a way to run the end-to-end tests on every pull request. Locally we use minikube, a tool to setup a local one-node Kubernetes cluster for development, or when we need a multi-node cluster, kube-spawn. But for automated CI tests we only currently require a single node setup. So we decided to use minikube to be able to easily catch any failed tests and debug and reproduce those locally.

Typically minikube requires a virtual machine to setup Kubernetes. One day this tweet was shared in our Slack. It seems that minikube has a not-so-well-documented way of running Kubernetes with no need for virtualization as it sets up localkube, a single binary for kubernetes that is executed in a Docker container and Travis CI already has Docker support. There is a warning against running this locally, but since we only use it on Travis CI, in an ephemeral environment, we concluded that this is an acceptable use case.

The setup

So this is what our setup looks like. Following is the example .travis.yml file:

sudo: required

env:
- CHANGE_MINIKUBE_NONE_USER=true

before_script:
- curl -Lo kubectl https://storage.googleapis.com/kubernetes-release/release/v1.7.0/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
- curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/
- sudo minikube start --vm-driver=none --kubernetes-version=v1.7.0
- minikube update-context
- JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}'; until kubectl get nodes -o jsonpath="$JSONPATH" 2>&1 | grep -q "Ready=True"; do sleep 1; done

How it works

First, it installs kubectl, which is a requirement of minikube. The need for sudo: required comes from minikube’s starting processes, which requires to be root. Having set the enviorment variable CHANGE_MINIKUBE_NONE_USER, minikube will automatically move config files to the appropriate place as well as adjust the permissions respectively. When using the none driver, the kubectl config and credentials generated will be owned by root and will appear in the root user’s home directory. The none driver then does the heavy lifting of setting up localkube on the host. Then the kubeconfig is updated with minikube update-context. And lastly we wait for Kubernetes to be up and ready.

Examples

This work is already being used in the Habitat Operator. For a simple live example setup have a look at this repo. If you have any questions feel free to ping me on twitter @LiliCosic.

Follow Kinvolk on twitter to get notified when new blog posts go live.

Habitat Operator - Running Habitat Services with Kubernetes

For the last few months, we’ve been working with the Habitat team at Chef to make Habitat-packaged applications run well in Kubernetes. The result of this collaboration is the Habitat Operator, a Kubernetes controller used to deploy, configure and manage applications packaged with Habitat inside of Kubernetes. This article will give an overview of that work — particularly the issues to address, solutions to those issues, and future work.

Habitat Overview

For the uninitiated, Habitat is a project designed to address building, deploying and running applications.

Building applications

Applications are built from shell scripts known as “plans” which describe how to build the application, and may optionally include configurations files and lifecycle hooks. From the information in the plan, Habitat can create a package of the application.

Deploy applications

In order to run an application with a container runtime like Docker or rkt, Habitat supports exporting packages to a Docker container image. You can then upload the container image to a registry and use it to deploy applications to a container orchestration system like Kubernetes.

Running applications

Applications packaged with Habitat — hereafter referred to as simply as applications — support the following runtime features.

These features are available because all Habitat applications run under a supervisor process called a Supervisor. The Supervisor takes care of restarting, reconfiguring and gracefully terminating services. The Supervisor also allows multiple instances of applications to run with the Supervisor communicating with other Supervisors via a gossip protocol. These can connect to form a ring and establish Service Groups for sharing configuration data and establishing topologies.

Integration with Kubernetes

Many of the features that Habitat provides overlap with features that are provided in Kubernetes. Where there is overlap, the Habitat Operator tries to translate, or defer, to the Kubernetes-native mechanism. One design goal of the Habitat Operator is to allow Kubernetes users to use the Kubernetes CLI without fear that Habitat applications will become out of sync. For example, update strategies are core feature of Kubernetes and should be handled by Kubernetes.

For the features that do not overlap — such as topologies and application binding — the Habitat Operator ensures that these work within Kubernetes.

Joining the ring

One of the fundamental challenges we faced when conforming Habitat to Kubernetes was forming and joining a ring. Habitat uses the --peer flag which is passed an IP address of a previously started Supervisor. But in the Kubernetes world this is not possible as all pods need to be started with the exact same command line flags. In order to be able to do this within Kubernetes, implemented a new flag in Habitat itself, --peer-watch-file. This flag takes a file which should contain a list of one or more IP addresses to the peers in the Service Group it would like to join. Habitat uses this information to form the ring between the Supervisors. This is implemented in the Habitat Operator using a Kubernetes ConfigMap which is mounted into each pod.

Initial Configuration

Habitat allows for drawing configuration information from different sources. One of them is a user.toml file which is used for initial configuration and is not gossiped within the ring. Because there can be sensitive data in configuration files, we use Kubernetes Secrets for all configuration data. The Habitat Operator mounts configuration files in the place where Habitat expects it to be found and the application automatically picks up this configuration as it normally would. This mechanism will also be reused to support configuration updates in the future.

Topologies

One of these is specifying the two different topologies that are supported in Habitat. The standalone topology — the default topology in Habitat — is used for applications that are independent of one another. With the leader/follower topology, the Supervisor handles leader election over the ring before the application starts. For this topology, three or more instances of an application must be available for a successful leader election to take place.

Ring encryption

A security feature of Habitat that we brought into the operator is securing the ring by encrypting all communications across the network.

Application binding

We also added an ability to do runtime binding, meaning that applications form a producer/consumer relationship at application start. The producer exports the configuration and the consumer, through the Supervisor ring, consumes that information. You can learn more about that in the demo below:

Future plans for Habitat operator

The Habitat Operator is in heavy development and we’re excited about the features that we have planned for the next months.

Export to Kubernetes

We’ve already started work on a exporter for Kubernetes. This will allow you to export the application you packaged with Habitat to a Docker image along with a generated manifest file that can be used to deploy directly to Kubernetes.

Dynamic configuration

As mentioned above, we are planning to extend the initial configuration and use the same logic for configuration updates. This work should be landing in Habitat very soon. With Habitat applications, configuration changes can be made without restarting pods. The behaviour for how to do configuration updates is defined in the applications Habitat plan.

Further Kubernetes integration and demos

We’re also looking into exporting to Helm charts in the near future. This could allow for bringing a large collection of Habitat-packaged to Kubernetes.

Another area to explore is integration between the Habitat Builder and Kubernetes. The ability to automatically recompile application, export images, and deploy to Kubernetes when dependencies are updated could bring great benefits to Habitat and Kubernetes users alike.

Conclusion

Please take the operator for a spin here. The first release is now available. All you need is an application packaged with Habitat and exported as Docker image, and that functionality is already in Habitat itself.

Note: The Habitat operator is compatible with Habitat version 0.36.0 onwards. If you have any questions feel free to ask on the #kubernetes channel in Habitat slack or open an issue on the Habitat operator.

Follow Kinvolk on twitter to get notified when new blog posts go live.

An update on gobpf - ELF loading, uprobes, more program types

Gophers by Ashley McNamara, Ponies by Deirdré Straughan - CC BY-NC-SA 4.0

Almost a year ago we introduced gobpf, a Go library to load and use eBPF programs from Go applications. Today we would like to give you a quick update on the changes and features added since then (i.e. the highlights of git log --oneline --no-merges --since="November 30th 2016" master).

Load BPF programs from ELF object files

With commit 869e637, gobpf was split into two subpackages (github.com/iovisor/gobpf/bcc and github.com/iovisor/gobpf/elf) and learned to load BPF programs from ELF object files. This allows users to pre-build their programs with clang/LLVM and its BPF backend as an alternative to using the BPF Compiler Collection.

One project where we at Kinvolk used pre-built ELF objects is the TCP tracer that we wrote for Weave Scope. Putting the program into the library allows us to go get and vendor the tracer as any other Go dependency.

Another important result of using the ELF loading mechanism is that the Scope container images are much smaller, as bcc and clang are not included and don’t add to the container image size.

Let’s see how this is done in practice by building a demo program to log open(2) syscalls to the ftrace trace_pipe:

// program.c

#include <linux/kconfig.h>
#include <linux/bpf.h>

#include <uapi/linux/ptrace.h>

// definitions of bpf helper functions we need, as found in
// http://elixir.free-electrons.com/linux/latest/source/samples/bpf/bpf_helpers.h

#define SEC(NAME) __attribute__((section(NAME), used))

#define PT_REGS_PARM1(x) ((x)->di)

static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
        (void *) BPF_FUNC_probe_read;
static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
        (void *) BPF_FUNC_trace_printk;

#define printt(fmt, ...)                                                   \
        ({                                                                 \
                char ____fmt[] = fmt;                                      \
                bpf_trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \
        })

// the kprobe

SEC("kprobe/SyS_open")
int kprobe__sys_open(struct pt_regs *ctx)
{
        char filename[256];

        bpf_probe_read(filename, sizeof(filename), (void *)PT_REGS_PARM1(ctx));

        printt("open(%s)\n", filename);

        return 0;
}

char _license[] SEC("license") = "GPL";
// this number will be interpreted by the elf loader
// to set the current running kernel version
__u32 _version SEC("version") = 0xFFFFFFFE;

On a Debian system, the corresponding Makefile could look like this:

# Makefile
# …

uname=$(shell uname -r)

build-elf:
        clang \
                -D__KERNEL__ \
                -O2 -emit-llvm -c program.c \
                -I /lib/modules/$(uname)/source/include \
                -I /lib/modules/$(uname)/source/arch/x86/include \
                -I /lib/modules/$(uname)/build/include \
                -I /lib/modules/$(uname)/build/arch/x86/include/generated \
                -o - | \
                llc -march=bpf -filetype=obj -o program.o

A small Go tool can then be used to load the object file and enable the kprobe with the help of gobpf:

// main.go

package main

import (
        "fmt"
        "os"
        "os/signal"

        "github.com/iovisor/gobpf/elf"
)

func main() {
        module := elf.NewModule("./program.o")
        if err := module.Load(nil); err != nil {
                fmt.Fprintf(os.Stderr, "Failed to load program: %v\n", err)
                os.Exit(1)
        }
        defer func() {
                if err := module.Close(); err != nil {
                        fmt.Fprintf(os.Stderr, "Failed to close program: %v", err)
                }
        }()

        if err := module.EnableKprobe("kprobe/SyS_open", 0); err != nil {
                fmt.Fprintf(os.Stderr, "Failed to enable kprobe: %v\n", err)
                os.Exit(1)
        }

        sig := make(chan os.Signal, 1)
        signal.Notify(sig, os.Interrupt, os.Kill)

        <-sig
}

Now every time a process uses open(2), the kprobe will log a message. Messages written with bpf_trace_printk can be seen in the trace_pipe “live trace”:

sudo cat /sys/kernel/debug/tracing/trace_pipe

With go-bindata it’s possible to bundle the compiled BPF program into the Go binary to build a single fat binary that can be shipped and installed conveniently.

Trace user-level functions with bcc and uprobes

Louis McCormack contributed support for uprobes in github.com/iovisor/gobpf/bcc and therefore it is now possible to trace user-level function calls. For example, to trace all readline() function calls from /bin/bash processes, you can run the bash_readline.go demo:

sudo -E go run ./examples/bcc/bash_readline/bash_readline.go

More supported program types for gobpf/elf

gobpf/elf learned to load programs of type TRACEPOINT, SOCKET_FILTER, CGROUP_SOCK and CGROUP_SKB:

Tracepoints

A program of type TRACEPOINT can be attached to any Linux tracepoint. Tracepoints in Linux are “a hook to call a function (probe) that you can provide at runtime.” A list of available tracepoints can be obtained with find /sys/kernel/debug/tracing/events -type d.

Socket filtering

Socket filtering is the mechanism used by tcpdump to retrieve packets matching an expression. With SOCKET_FILTER programs, we can filter data on a socket by attaching them with setsockopt(2).

cgroups

CGROUP_SOCK and CGROUP_SKB can be used to load and use programs specific to a cgroup. CGROUP_SOCK programs “run any time a process in the cgroup opens an AF_INET or AF_INET6 socket” and can be used to enable socket modifications. CGROUP_SKB programs are similar to SOCKET_FILTER and are executed for each network packet with the purpose of cgroup specific network filtering and accounting.

Continuous integration

We have setup continuous integration and written about how we use custom rkt stage1 images to test against various kernel versions. At the time of writing, gobpf has elementary tests to verify that programs and their sections can be loaded on kernel versions 4.4, 4.9 and 4.10 but no thorough testing of all functionality and features yet (e.g. perf map polling).

Miscellaneous

Thanks to contributors and clients

In closing, we’d like to thank all those who have contributed to gobpf. We look forward to merging more commits from contributors and seeing how others make use of gopbf.

A special thanks goes to Weaveworks for funding the work from which gobpf was born. Continued contributions have been possible through other clients, for whom we are helping build products (WIP) that leverage gobpf.

Introducing kube-spawn: a tool to create local, multi-node Kubernetes clusters

kube-spawn is a tool to easily start a local, multi-node Kubernetes cluster on a Linux machine. While its original audience was mainly developers of Kubernetes, it’s turned into a tool that is great for just trying Kubernetes out and exploring. This article will give a general introduction to kube-spawn and show how to use it.

Overview

kube-spawn aims to become the easiest means of testing and fiddling with Kubernetes on Linux. We started the project because it is still rather painful to start a multi-node Kubernetes cluster on our development machines. And the tools that do provide this functionality generally do not reflect the environments that Kubernetes will eventually be running on, a full Linux OS.

Running a Kubernetes cluster with kube-spawn

So, without further ado, let’s start our cluster. With one command kube-spawn fetches the Container Linux image, prepares the nodes, and deploys the cluster. Note that you can also do these steps individually with machinectl pull-raw, and the kube-spawn setup and init subcommands. But the up subcommand does this all for us.

$ sudo GOPATH=$GOPATH CNI_PATH=$GOPATH/bin ./kube-spawn up --nodes=3

When that command completes, you’ll have a 3-node Kubernetes cluster. You’ll need to wait for the nodes to be ready before its useful.

$ export KUBECONFIG=$GOPATH/src/github.com/kinvolk/kube-spawn/.kube-spawn/default/kubeconfig
$ kubectl get nodes
NAME           STATUS    AGE       VERSION
kube-spawn-0   Ready     1m        v1.7.0
kube-spawn-1   Ready     1m        v1.7.0
kube-spawn-2   Ready     1m        v1.7.0

Looks like all the nodes are ready. Let’s move on.

The demo app

In order to test that our cluster is working we’re going to deploy the microservices demo, Sock Shop, from our friends at Weaveworks. The Sock Shop is a complex microservices app that uses many components commonly found in real-world deployments. So it’s good to test that everything is working and gives us something more substantial to explore than a hello world app.

Cloning the demo app

To proceed, you’ll need to clone the microservices-demo repo and navigate to the deploy/kubernetes folder.

$ cd ~/repos
$ git clone https://github.com/microservices-demo/microservices-demo.git sock-shop
$ cd sock-shop/deploy/kubernetes/

Deploying the demo app

Now that we have things in place, let’s deploy. We first need to create the sock-shop namespace that the deployment expects.

$ kubectl create namespace sock-shop
namespace "sock-shop" created

With that, we’ve got all we need to deploy the app

$ kubectl create -f complete-demo.yaml
deployment "carts-db" created
service "carts-db" created
deployment "carts" created
service "carts" created
deployment "catalogue-db" created
service "catalogue-db" created
deployment "catalogue" created
service "catalogue" created
deployment "front-end" created
service "front-end" created
deployment "orders-db" created
service "orders-db" created
deployment "orders" created
service "orders" created
deployment "payment" created
service "payment" created
deployment "queue-master" created
service "queue-master" created
deployment "rabbitmq" created
service "rabbitmq" created
deployment "shipping" created
service "shipping" created
deployment "user-db" created
service "user-db" created
deployment "user" created
service "user" created

Once that completes, we still need to wait for all the pods to come up.

$ watch kubectl -n sock-shop get pods
NAME                            READY     STATUS    RESTARTS   AGE
carts-2469883122-nd0g1          1/1       Running   0          1m
carts-db-1721187500-392vt       1/1       Running   0          1m
catalogue-4293036822-d79cm      1/1       Running   0          1m
catalogue-db-1846494424-njq7h   1/1       Running   0          1m
front-end-2337481689-v8m2h      1/1       Running   0          1m
orders-733484335-mg0lh          1/1       Running   0          1m
orders-db-3728196820-9v07l      1/1       Running   0          1m
payment-3050936124-rgvjj        1/1       Running   0          1m
queue-master-2067646375-7xx9x   1/1       Running   0          1m
rabbitmq-241640118-8htht        1/1       Running   0          1m
shipping-2463450563-n47k7       1/1       Running   0          1m
user-1574605338-p1djk           1/1       Running   0          1m
user-db-3152184577-c8r1f        1/1       Running   0          1m

Accessing the sock shop

When they’re all ready, we have to find out which port and IP address we use to access the shop. For the port, let’s see which port the front-end services is using.

$ kubectl -n sock-shop get svc
NAME           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
carts          10.110.14.144    <none>        80/TCP         3m
carts-db       10.104.115.89    <none>        27017/TCP      3m
catalogue      10.110.157.8     <none>        80/TCP         3m
catalogue-db   10.99.103.79     <none>        3306/TCP       3m
front-end      10.105.224.192   <nodes>       80:30001/TCP   3m
orders         10.101.177.247   <none>        80/TCP         3m
orders-db      10.109.209.178   <none>        27017/TCP      3m
payment        10.107.53.203    <none>        80/TCP         3m
queue-master   10.111.63.76     <none>        80/TCP         3m
rabbitmq       10.110.136.97    <none>        5672/TCP       3m
shipping       10.96.117.56     <none>        80/TCP         3m
user           10.101.85.39     <none>        80/TCP         3m
user-db        10.107.82.6      <none>        27017/TCP      3m

Here we see that the front-end is exposed on port 30001 and it uses the external IP. This means that we can access the front-end services using any worker node IP address on port 30001. machinectl gives us each node’s IP address.

$ machinectl
MACHINE      CLASS     SERVICE        OS     VERSION  ADDRESSES
kube-spawn-0 container systemd-nspawn coreos 1492.1.0 10.22.0.137...
kube-spawn-1 container systemd-nspawn coreos 1492.1.0 10.22.0.138...
kube-spawn-2 container systemd-nspawn coreos 1492.1.0 10.22.0.139...

Remember, the first node is the master node and all the others are worker nodes. So in our case, we can open our browser to 10.22.0.138:30001 or 10.22.0.139:30001 and should be greeted by a shop selling socks.

Stopping the cluster

Once you’re done with your sock purchases, you can stop the cluster.

$ sudo ./kube-spawn stop
2017/08/10 01:58:00 turning off machines [kube-spawn-0 kube-spawn-1 kube-spawn-2]...
2017/08/10 01:58:00 All nodes are stopped.

A guided demo

If you’d like a more guided tour, you’ll find it here.

As mentioned in the video, kube-spawn creates a .kube-spawn directory in the current directory where you’ll find several files and directories under the default directory. In order to not be constrained by the size of each OS Container, we mount each node’s /var/lib/docker directory here. In this way, we can make use of the host’s disk space. Also, we don’t currently have a clean command. So you can run rm -rf .kube-spawn/ if you want to completely clean up things.

Conclusion

We hope you find kube-spawn as useful as we do. For us, it’s the easiest way to test changes to Kubernetes or spin up a cluster to explore Kubernetes.

There are still lots of improvements (some very obvious) that can be made. PRs are very much welcome!

All Systems Go! - The Userspace Linux Conference

At Kinvolk we spend a lot of time working on and talking about the Linux userspace. We can regularly be found presenting our work at various events and discussing the details of our work with those who are interested. These events are usually either very generally about open source, or focused on a very specific technology, like containers, systemd, or ebpf. While these events are often awesome, and absolutely essential, they simply have a focus that is either too broad, or too specific.

What we felt was missing was an event focused on the Linux userspace itself, and less on the projects and products built on top, or the kernel below. This is the focus of All Systems Go! and why we are excited to be a part of it.

All Systems Go! is designed to be a community event. Tickets to All Systems Go! are affordable — starting at less than 30 EUR — and the event takes place during the weekend, making it more accessible to hobbyists and students. It’s also conveniently scheduled to fall between DockerCon EU in Copenhagen and Open Source Summit in Prague.

Speakers

To make All Systems Go! work, we’ve got to make sure that we get the people to attend who are working at this layer of the system, and on the individual projects that make up the Linux userspace. As a start, we’ve invited a first round of speakers, who also happen to be the CFP selection committee. We’re very happy to welcome to the All Systems Go! team…

While we’re happy to have this initial group of speakers, what’s really going to make All Systems Go! awesome are all the others in the community who submit their proposals and offer their perspectives and voices.

Sponsorship

Sponsors are crucial to open source community events. All Systems Go! is no different. In fact, sponsors are essential to keeping All Systems Go! an affordable and accessible event.

We will soon be announcing our first round of sponsors. If your organization would like to be amongst that group please have a look at our sponsorship prospectus and get in touch.

See you there!

We’re looking forward to welcoming the Linux userspace community to Berlin. Hope to see you there!

Using custom rkt stage1 images to test against various kernel versions

Introduction

When writing software that is tightly coupled with the Linux kernel, it is necessary to test on multiple versions of the kernel. This is relatively easy to do locally with VMs, but when working on open-source code hosted on Github, one wants to make these tests a part of the project’s continuous integration (CI) system to ensure that each pull request runs the tests and passes before merging.

Most CI systems run tests inside containers and, very sensibly, use various security mechanisms to restrict what the code being tested can access. While this does not cause problems for most use cases, it does for us. It blocks certain syscalls that are needed to, say, test a container runtime like rkt, or load ebpf programs into the kernel for testing, like we need to do to test gobpf and tcptracer-bpf. It also doesn’t allow us to run virtual machines which we need to be able to do to run tests on different versions of the kernel.

Finding a continuous integration service

While working on the rkt project, we did a survey of CI systems to find the ones that we could use to test rkt itself. Because of the above-stated requirements, it was clear that we needed one that gave us the option to run tests inside a virtual machine. This makes the list rather small; in fact, we were left with only SemaphoreCI.

SemaphoreCI supports running Docker inside of the test environment. This is possible because the test environment they provide for this is simply a VM. For rkt, this allowed us to run automatic tests for the container runtime each time a PR was submitted and/or changed.

However, it doesn’t solve the problem of testing on various kernels and kernel configurations as we want for gobpf and tcptracer-bpf. Luckily, this is where rkt and its KVM stage1 come to the rescue.

Our solution

To continuously test the work we are doing on Weave Scope, tpctracer-bpf and gobpf, we not only need a relatively new Linux kernel, but also require a subset of features like CONFIG_BPF=y or CONFIG_HAVE_KPROBES=y to be enabled.

With rkt’s KVM stage1 we can run our software in a virtual machine and, thanks to rkt’s modular architecture, build and use a custom stage1 suited to our needs. This allows us to run our tests on any platform that allows rkt to run; in our case, Semaphore CI.

Building a custom rkt stage1 for KVM

Our current approach relies on App Container Image (ACI) dependencies. All of our custom stage1 images are based on rkt’s coreos.com/rkt/stage1-kvm. In this way, we can apply changes to particular components (e.g. the Linux kernel) while reusing the other parts of the upstream stage1 image.

An ACI manifest template for such an image could look like the following.

{
        "acKind": "ImageManifest",
        "acVersion": "0.8.9",
        "name": "kinvolk.io/rkt/stage1-kvm-linux-{{kernel_version}}",
        "labels": [
                {
                        "name": "arch",
                        "value": "amd64"
                },
                {
                        "name": "os",
                        "value": "linux"
                },
                {
                        "name": "version",
                        "value": "0.1.0"
                }
        ],
        "annotations": [
                {
                        "name": "coreos.com/rkt/stage1/run",
                        "value": "/init"
                },
                {
                        "name": "coreos.com/rkt/stage1/enter",
                        "value": "/enter_kvm"
                },
                {
                        "name": "coreos.com/rkt/stage1/gc",
                        "value": "/gc"
                },
                {
                        "name": "coreos.com/rkt/stage1/stop",
                        "value": "/stop_kvm"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/add",
                        "value": "/app-add"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/rm",
                        "value": "/app-rm"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/start",
                        "value": "/app-start"
                },
                {
                        "name": "coreos.com/rkt/stage1/app/stop",
                        "value": "/app-stop"
                },
                {
                        "name": "coreos.com/rkt/stage1/interface-version",
                        "value": "5"
                }
        ],
        "dependencies": [
                {
                        "imageName": "coreos.com/rkt/stage1-kvm",
                        "labels": [
                                {
                                        "name": "os",
                                        "value": "linux"
                                },
                                {
                                        "name": "arch",
                                        "value": "amd64"
                                },
                                {
                                        "name": "version",
                                        "value": "1.23.0"
                                }
                        ]
                }
        ]
}

Note: rkt doesn’t automatically fetch stage1 dependencies and we have to pre-fetch those manually.

To build a kernel (arch/x86/boot/bzImage), we use make bzImage after applying a single patch to the source tree. Without the patch, the kernel would block and not return control to rkt.

# change directory to kernel source tree
curl -LsS https://github.com/coreos/rkt/blob/v1.23.0/stage1/usr_from_kvm/kernel/patches/0001-reboot.patch -O
patch --silent -p1 < 0001-reboot.patch
# configure kernel
make bzImage

We now can combine the ACI manifest with a root filesystem holding our custom built kernel, for example:

aci/4.9.4/
├── manifest
└── rootfs
    └── bzImage

We are now ready to build the stage1 ACI with actool:

actool build --overwrite aci/4.9.4 my-custom-stage1-kvm.aci

Run rkt with a custom stage1 for KVM

rkt offers multiple command line flags to be provided with a stage1; we use --stage1-path=. To smoke test our newly built stage1, we run a Debian Docker container and call uname -r so we make sure our custom built kernel is actually used:

rkt fetch image coreos.com/rkt/stage1-kvm:1.23.0 # due to rkt issue #2241
rkt run \
  --insecure-options=image \
  --stage1-path=./my-custom-stage1-kvm.aci \
  docker://debian --exec=/bin/uname -- -r
4.9.4-kinvolk-v1
[...]

We set CONFIG_LOCALVERSION="-kinvolk-v1" in the kernel config and the version is correctly shown as 4.9.4-kinvolk-v1.

Run on Semaphore CI

Semaphore does not include rkt by default on their platform. Hence, we have to download rkt in semaphore.sh as a first step:

#!/bin/bash

readonly rkt_version="1.23.0"

if [[ ! -f "./rkt/rkt" ]] || \
  [[ ! "$(./rkt/rkt version | awk '/rkt Version/{print $3}')" == "${rkt_version}" ]]; then

  curl -LsS "https://github.com/coreos/rkt/releases/download/v${rkt_version}/rkt-v${rkt_version}.tar.gz" \
    -o rkt.tgz

  mkdir -p rkt
  tar -xvf rkt.tgz -C rkt --strip-components=1
fi

[...]

After that we can pre-fetch the stage1 image we depend on and then run our tests. Note that we now use ./rkt/rkt. And we use timeout to make sure our tests fail if they cannot be finished in a reasonable amount of time.

Example:

sudo ./rkt/rkt image fetch --insecure-options=image coreos.com/rkt/stage1-kvm:1.23.0
sudo timeout --foreground --kill-after=10 5m \
  ./rkt/rkt \
  --uuid-file-save=./rkt-uuid \
  --insecure-options=image,all-run \
  --stage1-path=./rkt/my-custom-stage1-kvm.aci \
  ...
  --exec=/bin/sh -- -c \
  'cd /go/... ; \
    go test -v ./...'

--uuid-file-save=./rkt-uuid is required to determine the UUID of the started container from semaphore.sh to read its exit status (since it is not propagated on the KVM stage1) after the test finished and exit accordingly:

[...]

test_status=$(sudo ./rkt/rkt status $(<rkt-uuid) | awk '/app-/{split($0,a,"=")} END{print a[2]}')
exit $test_status

Bind mount directories from stage1 into stage2

If you want to provide data to stage2 from stage1 you can do this with a small systemd drop-in unit to bind mount the directories. This allows you to add or modify content without actually touching the stage2 root filesystem.

We did the following to provide the Linux kernel headers to stage2:

# add systemd drop-in to bind mount kernel headers
mkdir -p "${rootfs_dir}/etc/systemd/system/[email protected]"
cat <<EOF >"${rootfs_dir}/etc/systemd/system/[email protected]/10-bind-mount-kernel-header.conf"
[Service]
ExecStartPost=/usr/bin/mkdir -p %I/${kernel_header_dir}
ExecStartPost=/usr/bin/mount --bind "${kernel_header_dir}" %I/${kernel_header_dir}
EOF

Note: for this to work you need to have mkdir in stage1, which is not included in the default rkt stage1-kvm. We use the one from busybox: https://busybox.net/downloads/binaries/1.26.2-i686/busybox_MKDIR

Automating the steps

We want to be able to do this for many kernel versions. Thus, we have created a tool, stage1-builder, that does most of this for us. With stage1-builder you simply need to add the kernel configuration to the config directory and run the ./builder script. The result is an ACI file containing our custom kernel with a dependency on the upstream kvm-stage1.

Conclusion

With SemaphoreCI providing us with a proper VM and rkt’s modular stage1 architecture, we have put together a CI pipeline that allows us to test gobpf and tcptracer-bpf on various kernels. In our opinion this setup is much preferable to the alternative, setting up and maintaining Jenkins.

Interesting to point out is that we did not have to use or make changes to rkt’s build system. Leveraging ACI dependencies was all we needed to swap out the KVM stage1 kernel. For the simple case of testing software on various kernel versions, rkt’s modular design has proven to be very useful.

Kinvolk Presenting at FOSDEM 2017

The same procedure as last year, Miss Sophie?
The same procedure as every year, James!

As with every year, we’ve reserved the first weekend of February to attend FOSDEM, the premier open-source conference in Europe. We’re looking forward to having drinks and chatting with other open-source contributors and enthusiasts.

But it’s not all fun and games for us. The Kinvolk team has three talks; one each in the Go, Testing and Automation, & Linux Containers and Microservices devrooms.

The talks

We look forward to sharing our work and having conversations about the following topics…

If you’ll be there and are interested in those, or other projects we work on, please do track us down.

We look forward to seeing you there!

Introducing gobpf - Using eBPF from Go

Gopher by Takuya Ueda - CC BY 3.0

What is eBPF?

eBPF is a “bytecode virtual machine” in the Linux kernel that is used for tracing kernel functions, networking, performance analysis and more. Its roots lay in the Berkley Packet Filter (sometimes called LSF, Linux Socket Filtering), but as it supports more operations (e.g. BPF_CALL 0x80 /* eBPF only: function call */) and nowadays has much broader use than packet filtering on a socket, it’s called extended BPF.

With the addition of the dedicated bpf() syscall in Linux 3.18, it became easier to perform the various eBPF operations. Further, the BPF compiler collection from the IO Visor Project and its libbpf provide a rich set of helper functions as well as Python bindings that make it more convenient to write eBPF powered tools.

To get an idea of how eBPF looks, let’s take a peek at struct bpf_insn prog[] - a list of instructions in pseudo-assembly. Below we have a simple user-space C program to count the number of fchownat(2) calls. We use bpf_prog_load from libbpf to load the eBPF instructions as a kprobe and use bpf_attach_kprobe to attach it to the syscall. Now each time fchownat is called, the kernel executes the eBPF program. The program loads the map (more about maps later), increments the counter and exits. In the C program, we read the value from the map and print it every second.

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <linux/version.h>

#include <bcc/bpf_common.h>
#include <bcc/libbpf.h>

int main() {
	int map_fd, prog_fd, key=0, ret;
	long long value;
	char log_buf[8192];
	void *kprobe;

	/* Map size is 1 since we store only one value, the chown count */
	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1);
	if (map_fd < 0) {
		fprintf(stderr, "failed to create map: %s (ret %d)\n", strerror(errno), map_fd);
		return 1;
	}

	ret = bpf_update_elem(map_fd, &key, &value, 0);
	if (ret != 0) {
		fprintf(stderr, "failed to initialize map: %s (ret %d)\n", strerror(errno), ret);
		return 1;
	}

	struct bpf_insn prog[] = {
		/* Put 0 (the map key) on the stack */
		BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
		/* Put frame pointer into R2 */
		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
		/* Decrement pointer by four */
		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
		/* Put map_fd into R1 */
		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
		/* Load current count from map into R0 */
		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
			     BPF_FUNC_map_lookup_elem),
		/* If returned value NULL, skip two instructions and return */
		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
		/* Put 1 into R1 */
		BPF_MOV64_IMM(BPF_REG_1, 1),
		/* Increment value by 1 */
		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),
		/* Return from program */
		BPF_EXIT_INSN(),
	};

	prog_fd = bpf_prog_load(BPF_PROG_TYPE_KPROBE, prog, sizeof(prog), "GPL", LINUX_VERSION_CODE, log_buf, sizeof(log_buf));
	if (prog_fd < 0) {
		fprintf(stderr, "failed to load prog: %s (ret %d)\ngot CAP_SYS_ADMIN?\n%s\n", strerror(errno), prog_fd, log_buf);
		return 1;
	}

	kprobe = bpf_attach_kprobe(prog_fd, "p_sys_fchownat", "p:kprobes/p_sys_fchownat sys_fchownat", -1, 0, -1, NULL, NULL);
	if (kprobe == NULL) {
		fprintf(stderr, "failed to attach kprobe: %s\n", strerror(errno));
		return 1;
	}

	for (;;) {
		ret = bpf_lookup_elem(map_fd, &key, &value);
		if (ret != 0) {
			fprintf(stderr, "failed to lookup element: %s (ret %d)\n", strerror(errno), ret);
		} else {
			printf("fchownat(2) count: %lld\n", value);
		}
		sleep(1);
	}

	return 0;
}

The example requires libbcc and can be compiled with:

gcc -I/usr/include/bcc/compat main.c -o chowncount -lbcc

Nota bene: the increment in the example code is not atomic. In real code, we would have to use one map per CPU and aggregate the result.

It is important to know that eBPF programs run directly in the kernel and that their invocation depends on the type. They are executed without change of context. As we have seen above, kprobes for example are triggered whenever the kernel executes a specified function.

Thanks to clang and LLVM, it’s not necessary to actually write plain eBPF instructions. Modules can be written in C and use functions provided by libbpf (as we will see in the gobpf example below).

eBPF Program Types

The type of an eBPF program defines properties like the kernel helper functions available to the program or the input it receives from the kernel. Linux 4.8 knows the following program types:

// https://github.com/torvalds/linux/blob/v4.8/include/uapi/linux/bpf.h#L90-L98
enum bpf_prog_type {
	BPF_PROG_TYPE_UNSPEC,
	BPF_PROG_TYPE_SOCKET_FILTER,
	BPF_PROG_TYPE_KPROBE,
	BPF_PROG_TYPE_SCHED_CLS,
	BPF_PROG_TYPE_SCHED_ACT,
	BPF_PROG_TYPE_TRACEPOINT,
	BPF_PROG_TYPE_XDP,
};

A program of type BPF_PROG_TYPE_SOCKET_FILTER, for instance, receives a struct __sk_buff * as its first argument whereas it’s struct pt_regs * for programs of type BPF_PROG_TYPE_KPROBE.

eBPF Maps

Maps are a “generic data structure for storage of different types of data” and can be used to share data between eBPF programs as well as between kernel and userspace. The key and value of a map can be of arbitrary size as defined when creating the map. The user also defines the maximum number of entries (max_entries). Linux 4.8 knows the following map types:

// https://github.com/torvalds/linux/blob/v4.8/include/uapi/linux/bpf.h#L78-L88
enum bpf_map_type {
	BPF_MAP_TYPE_UNSPEC,
	BPF_MAP_TYPE_HASH,
	BPF_MAP_TYPE_ARRAY,
	BPF_MAP_TYPE_PROG_ARRAY,
	BPF_MAP_TYPE_PERF_EVENT_ARRAY,
	BPF_MAP_TYPE_PERCPU_HASH,
	BPF_MAP_TYPE_PERCPU_ARRAY,
	BPF_MAP_TYPE_STACK_TRACE,
	BPF_MAP_TYPE_CGROUP_ARRAY,
};

While BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_ARRAY are generic maps for different types of data, BPF_MAP_TYPE_PROG_ARRAY is a special purpose array map. It holds file descriptors referring to other eBPF programs and can be used by an eBPF program to “replace its own program flow with the one from the program at the given program array slot”. The BPF_MAP_TYPE_PERF_EVENT_ARRAY map is for storing a data of type struct perf_event in a ring buffer.

In the example above we used a map of type hash with a size of 1 to hold the call counter.

gobpf

In the context of the work we are doing on Weave Scope for Weaveworks, we have been working extensively with both eBPF and Go. As Scope is written in Go, it makes sense to use eBPF directly from Go.

In looking at how to do this, we stumbled upon some code in the IO Visor Project that looked like a good starting point. After talking to the folks at the project, we decided to move this out into a dedicated repository: https://github.com/iovisor/gobpf gobpf is a Go library that leverages the bcc project to make working with eBPF programs from Go simple.

To get an idea of how this works, the following example chrootsnoop shows how to use a bpf.PerfMap to monitor chroot(2) calls:

package main

import (
	"bytes"
	"encoding/binary"
	"fmt"
	"os"
	"os/signal"
	"unsafe"

	"github.com/iovisor/gobpf"
)

import "C"

const source string = `
#include <uapi/linux/ptrace.h>
#include <bcc/proto.h>

typedef struct {
	u32 pid;
	char comm[128];
	char filename[128];
} chroot_event_t;

BPF_PERF_OUTPUT(chroot_events);

int kprobe__sys_chroot(struct pt_regs *ctx, const char *filename)
{
	u64 pid = bpf_get_current_pid_tgid();
	chroot_event_t event = {
		.pid = pid >> 32,
	};
	bpf_get_current_comm(&event.comm, sizeof(event.comm));
	bpf_probe_read(&event.filename, sizeof(event.filename), (void *)filename);
	chroot_events.perf_submit(ctx, &event, sizeof(event));
	return 0;
}
`

type chrootEvent struct {
	Pid      uint32
	Comm     [128]byte
	Filename [128]byte
}

func main() {
	m := bpf.NewBpfModule(source, []string{})
	defer m.Close()

	chrootKprobe, err := m.LoadKprobe("kprobe__sys_chroot")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to load kprobe__sys_chroot: %s\n", err)
		os.Exit(1)
	}

	err = m.AttachKprobe("sys_chroot", chrootKprobe)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to attach kprobe__sys_chroot: %s\n", err)
		os.Exit(1)
	}

	chrootEventsTable := bpf.NewBpfTable(0, m)

	chrootEventsChannel := make(chan []byte)

	chrootPerfMap, err := bpf.InitPerfMap(chrootEventsTable, chrootEventsChannel)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to init perf map: %s\n", err)
		os.Exit(1)
	}

	sig := make(chan os.Signal, 1)
	signal.Notify(sig, os.Interrupt, os.Kill)

	go func() {
		var chrootE chrootEvent
		for {
			data := <-chrootEventsChannel
			err := binary.Read(bytes.NewBuffer(data), binary.LittleEndian, &chrootE)
			if err != nil {
				fmt.Fprintf(os.Stderr, "Failed to decode received chroot event data: %s\n", err)
				continue
			}
			comm := (*C.char)(unsafe.Pointer(&chrootE.Comm))
			filename := (*C.char)(unsafe.Pointer(&chrootE.Filename))
			fmt.Printf("pid %d %s called chroot(2) on %s\n", chrootE.Pid, C.GoString(comm), C.GoString(filename))
		}
	}()

	chrootPerfMap.Start()
	<-sig
	chrootPerfMap.Stop()
}

You will notice that our eBPF program is written in C for this example. The bcc project uses clang to convert the code to eBPF instructions.

We don’t have to interact with libbpf directly from our Go code, as gobpf implements a callback and makes sure we receive the data from our eBPF program through the chrootEventsChannel.

To test the example, you can run it with sudo -E go run chrootsnoop.go and for instance execute any systemd unit with RootDirectory statement. A simple chroot ... also works, of course.

# hello.service
[Unit]
Description=hello service

[Service]
RootDirectory=/tmp/chroot
ExecStart=/hello

[Install]
WantedBy=default.target

You should see output like:

pid 7857 (hello) called chroot(2) on /tmp/chroot

Conclusion

With its growing capabilities, eBPF has become an indispensable tool for modern Linux system software. gobpf helps you to conveniently use libbpf functionality from Go.

gobpf is in a very early stage, but usable. Input and contributions are very much welcome.

If you want to learn more about our use of eBPF in software like Weave Scope, stay tuned and have a look at our work on GitHub: https://github.com/kinvolk

Follow Kinvolk on Twitter to get notified when new blog posts go live.

Testing web services with traffic control on Kubernetes

This is part 2 of our “testing applications with traffic control series”. See part 1, testing degraded network scenarios with rkt, for detailed information about how traffic control works on Linux.

In this installment we demonstrate how to test web services with traffic control on Kubernetes. We introduce tcd, a simple traffic control daemon developed by Kinvolk for this demo. Our demonstration system runs on Openshift 3, Red Hat’s Container Platform based on Kubernetes, and uses the excellent Weave Scope, an interactive container monitoring and visualization tool.

We’ll be giving a live demonstration of this at the OpenShift Commons Briefing on May 26th, 2016. Please join us there.

The premise

As discussed in part 1 of this series, tests generally run under optimal networking conditions. This means that standard testing procedures neglect a whole bevy of issues that can arise due to poor network conditions.

Would it not be prudent to also test that your services perform satisfactorily when there is, for example, high packet loss, high latency, a slow rate of transmission, or a combination of those? We think so, and if you do too, please read on.

Traffic control on a distributed system

Let’s now make things more concrete by using tcd in our Kubernetes cluster.

The setup

To get started, we need to start an OpenShift ready VM to provide us our Kubernetes cluster. We’ll then create an OpenShift project and do some configuration.

If you want to follow along, you can go to our demo repository which will guide you through installing and setting up things. The pieces Before diving into the traffic control demo, we want to give you a really quick overview of tcd, OpenShift and Weave Scope.

tcd (traffic control daemon)

tcd is a simple daemon that runs on each Kubernetes node and responds to API calls. tcd manipulates the traffic control settings of the pods using the tc command which we briefly mentioned in part 1. It’s decoupled from the service being tested, meaning you can stop and restart the daemon on a pod without affecting its connectivity.

In this demo, it receives commands from buttons exposed in Weave Scope.

OpenShift

OpenShift is Red Hat’s container platform that makes it simple to build, deploy, manage and secure containerized applications at scale on any cloud infrastructure, including Red Hat’s own hosted offering, OpenShift Dedicated. Version 3 of OpenShift uses Kubernetes under the hood to maintain cluster health and easily scale services.

In the following figure, you see an example of the OpenShift dashboard with the running pods.

Here we have 1 Weave Scope App pod, 3 ping test pods, 1 tcd pod, and one Weave Scope App. Using the arrow buttons one can scale the application up and down and the circle changes color depending on the status of the application (e.g. scaling, terminating, etc.).

Weave Scope

Weave Scope helps to intuitively understand, monitor, and control containerized applications. It visually represents pods and processes running on Kubernetes and allows one to drill into pods, showing information such as CPU & memory usage, running processes, etc. One can also stop, start, and interact with containerized applications directly through its UI.

While this graphic shows Weave Scope displaying containers, we see at the top that we can also display information about processes and hosts.

How the pieces fit together

Now that we understand the individual pieces, let’s see how it all works together. Below is a diagram of our demo system.

Here we have 2 Kubernetes nodes each running one instance of the tcd daemon. tcd can only manage the traffic control settings of pods local to the Kubernetes node on which it’s running, thus the need for one per node.

On the right we see the Weave Scope app showing details for the selected pod; in this case, the one being pointed to by (4). In the red oval, we see the three buttons we’ve added to Scope app for this demo. These set the network connectivity parameters of the selected pod’s egress traffic to a latency of 2000ms, 300ms, 1ms, respectively, from left to right.

When clicked (1), the scope app sends a message (2) to the Weave Scope probe running on the selected pod’s Kubernetes node. The Weave Scope probe sends a gRPC message (3) to the tcd daemon, in this case a ConfigureEgressMethod message, running on its Kubernetes node telling it to configure the pods egress traffic (4) accordingly.

While this demo only configures the latency, tcd can also be used to configure the bandwidth and the percentage of packet drop. As we saw in part 1, those parameters are features directly provided by the Linux netem queuing discipline.

Being able to dynamically change the network characteristics for each pod, we can observe the behaviour of services during transitions as well as in steady state. Of course, by observe we mean test,which we’ll turn to now.

Testing with traffic control

Now for 2 short demos to show how traffic control can be used for testing.

Ping test

This is a contrived demo to show that the setup works and we can, in fact, manipulate the egress traffic characteristics of a pod.

The following video shows a pod downloading a small file from Internet with the wget command, with the target host being the one for which we are adjusting the packet latency.

It should be easy to see the affects that adjusting the latency has; with greater latency it takes longer to get a reply.

Guestbook app test

We use the Kubernetes guestbook example for our next, more real-world, demo. Some small modifications have been made to provide user-feedback when the reply from the web server takes a long time, showing a “loading…” message. Generally, this type of thing goes untested because, as we mentioned in the introduction, our tests run under favorable networking conditions.

Tools like Selenium and agouti allow for testing web applications in an automated way without manually interacting with a browser. For this demo we’ll be using agouti with its Chrome backend so that we can see the test run.

In the following video we see this feature being automatically tested by a Go script using the Ginkgo testing framework and Gomega matcher library.

In this demo, testers still need to configure the traffic control latency manually by clicking on the Weave Scope app buttons before running the test. However, since tcd can accept commands over gRPC, the Go script could easily connect to tcd to perform that configuration automatically, and dynamically, at run time. We’ll leave that as an exercise for the reader. :)

Conclusion

With Kubernetes becoming a defacto building block of modern container platforms, we now have a basis on which to start integrating features in a standardized way that have long gone ignored. We think traffic control for testing, and other creative endeavors, is a good example of this.

If you’re interested in moving this forward, we encourage you to take what we’ve started and run with it. And whether you just want to talk to us about this or you need professional support in your efforts, we’d be happy to talk to you.

Thanks to…

We’d like to thank Ilya & Tom from Weaveworks and Jorge & Ryan from Red Hat for helping us with some technical issues we ran into while setting up this demo. And a special thanks to Diane from the OpenShift project for helping coordinate the effort.

Introducing systemd.conf 2016

The systemd project will be having its 2nd conference—systemd.conf—from Sept. 28th to Oct. 1st, once again at betahaus in Berlin. After the success of last year’s conference, we’re looking forward to having much of the systemd community in Berlin for a second consecutive year. As this year’s event takes place just before LinuxCon Europe, we’re expecting some new faces.

Kinvolk’s involvement

As an active user and contributor to systemd, currently through our work on rkt, we’re interested in promoting systemd and helping provide a place for the systemd community to gather.

Last year, Kinvolk helped with much of the organization. This year, we’re happy to be expanding our involvement to include handling the financial-side of the event.

In general, Kinvolk is willing to help provide support to open source projects who want to hold events in Berlin. Just send us a mail to [email protected]

Don’t fix what isn’t broken

As feedback from last year’s post–conference survey showed, most attendees were pleased with the format. Thus, this year very little will change. The biggest difference is that we’re adding another room to accommodate a few more people and to facilitate impromptu breakout sessions. Some other small changes are that we’ll have warm lunches instead of sandwiches and we’ve dropped the speakers dinner as we felt it wasn’t in line with the goal of bringing all attendees together.

Workshop day

A new addition to systemd.conf, is the workshop day. The audience for systemd.conf 2015 was predominantly systemd contributors and proficient users. This was very much expected and intended.

However, we also want to give people of varying familiarity with the systemd project the chance to learn more from the people who know it best. The workshop day is intended to facilitate this. The call for presentations (CfP) will include a call for workshop sessions. These workshop sessions will be 2 to 3-hour hands-on sessions covering various areas of, or related to, the systemd project. You can consider it a day of systemd training if that helps with getting approval to attend. :)

As we expect a different audience for workshops than for the presentation and hackfest days, we will be issuing separate tickets. Tickets will become available once the call for participation opens.

Get involved!

There are several ways you can help make systemd.conf 2016 a success.

Become a sponsor

These events are only possible with the support of sponsors. In addition to helping the event be more awesome, your sponsorship allows us to bring more of the community together by sponsoring the attendance of those community member that need financial assistance to attend.

See the systemd.conf 2016 website for how to become a sponsor.

Submitting talk and workshop proposals

systemd.conf is only as good as the people who attend and the content they provide. In a few weeks we’ll be announcing the opening of the CfP. If you, or your organization, is doing interesting things with systemd, we encourage you to submit a proposal. If you want to spread your knowledge of systemd with others, please consider submitting a proposal for a workshop session.

We’re excited about what this year’s event will bring and look forward to seeing you at systemd.conf 2016!

Testing Degraded Network Scenarios with rkt

The current state of testing

Testing applications is important. Some even go as far as saying, “If it isn’t tested, it doesn’t work”. While that may have both a degree of truth and untruth to it, the rise of continuous integration (CI) and automated testing have shown that the software industry is taking testing seriously.

However, there is at least one area of testing that is difficult to automate and, thus, hasn’t been adequately incorporated into testing scenarios: poor network connectivity.

The typical testing process has the developer as the first line of defence. Developers usually work within reliable networking conditions. The developers then submit their code to a CI system which also runs tests under good networking conditions. Once the CI system goes green, internal testing is usually done; ship it!

Nowhere in this process were scenarios tested where your application experiences degraded network conditions. If your internal tests don’t cover these scenarios then it’s your users who’ll be doing the testing. This is far from an ideal situation and goes against the “test early test often” mantra of CI; a bug will cost you more the later it’s caught.

Three examples

To make this more concrete, let’s look at a few examples where users might notice issues that you, or your testing infrastructure, may not:

  • A web shop you click on “buy”, it redirects to a new page but freezes because of a connection issue. The user does not get feedback whether the javascript code will try again automatically; the user does not know whether she should refresh. That’s a bug. Once fixed, how do you test it? You need to break the connection just before the test script clicks on the “buy” link.
  • A video stream server The Real-Time Protocol (RTP) uses UDP packets. If some packets drop or arrive too late, it’s not a big deal; the video player will display a degraded video because of the missing packets but the stream will otherwise play just fine. Or, will it? So how can the developers of a video stream server test a scenario where 3% of packets are dropped or delayed?
  • Applications like etcd or zookeeper implement a consensus protocol. They should be designed to handle a node disconnecting from the network and network splits. See the approach CoreOS takes for an example.

It doesn’t take much imagination to come up with more, but these should be enough to make the point.

Where Linux can help

What functionality does the Linux kernel provide to enable us to test these scenarios?

Linux provides a means to shape both the egress traffic (emitted by a network interface) and to some extend the ingress traffic (received by a network interface). This is done by way of qdiscs, short for queuing disciplines. In essence, a qdisc is a packet scheduler. Using different qdiscs we can change the way packets are scheduled. qdiscs can have associated classes and filters. These all combine to let us delay, drop, or rate-limit packets, among a host of other things. A complete description is out of the scope of this blog post.

For our purposes, we’ll just look at one qdisc called “netem”, short for network emulation. This will allow us to tweak the packet scheduling characteristics we want.

What about containers?

Up to this point we haven’t even mentioned containers. That’s because the story is the same with regards to traffic control whether we’re talking about bare-metal servers, VMs or containers. Containers reside in their own network namespace, providing the container with a completely isolated network. Thus, the traffic between containers, or between a container and the host, can all be shaped in the same way.

Testing the idea

As a demonstration I’ve created a simple demo that starts an RTP server in a container using rkt. In order to easily tweak network parameters, I’ve hacked up a GUI written in Gtk/Javascript. And finally, to see the results we just need to point a video player to our RTP server.

We’ll step through the demo below. But if you want to play along at home, you can find the code in the kinvolk/demo repo on Github

Running the demo

First, I start the video streaming server in a rkt pod. The server streams the Elephant Dreams movie to a media player via the RTP/RTSP protocol. RTSP uses a TCP connection to send commands to the server. Examples of commands are choosing the file to play or seeking to a point in the middle of the stream. RTP it what actually sends the video via UDP packets.

Second, we start the GUI to dynamically change some parameters of the network emulator. What this does is connect to the rkt network namespace and change the egress qdisc using Linux’s tc command.

Now we can adjust the values as we like. For example, when I add 5% packet loss, the quality is degraded but not interrupted. When I remove the packet loss, the video becomes clear again. When I add 10s latency in the network, the video freezes. Play the video to see this in action.

What this shows us is that traffic control can be used effectively with containers to test applications - in this case a media server.

Next steps

The drawback to this approach is that it’s still manual. For automated testing we don’t want a GUI. Rather, we need a means of scripting various scenarios.

In rkt we use CNI network plugins to configure the network. Interestingly, several plugins can be used together to defines several network interfaces. What I’d like to see is a plugin added that allows one to configure traffic control in the network namespace of the container.

In order to integrate this into testing frameworks, the traffic control parameters should be dynamically adjustable, allowing for the scriptability mentioned above.

Stay tuned…

In a coming blog post, we’ll show that this is not only interesting when using rkt as an isolated component. It’s more interesting when tested in a container orchestration system like Kubernetes.

Follow Kinvolk on twitter to get notified when new blog posts go live.

Welcome rkt 1.0!

About 14 months ago, CoreOS announced their intention to build a new container runtime based on the App Container Specification, introduced at the same time. Over these past 14 months, the rkt team has worked to make rkt viable for production use and get to a point where we could offer certain stability guarantees. With today’s release of rkt 1.0, the rkt team believes we have reached that point.

We’d like to congratulate CoreOS on making it to this milestone and look forward to seeing rkt mature. With rkt, CoreOS has provided the community with a container runtime with first-class integration on modern Linux systems and a security-first approach.

We’d especially like to thank CoreOS for giving us the chance to be involved with rkt. Over the past months we’ve had the pleasure to make substantial contributions to rkt. Now that the 1.0 release is out, we look forward to continuing that, with even greater input from and collaboration with the community.

At Kinvolk, we want to push Linux forward by contributing to projects that are at the core of modern Linux systems. We believe that rkt is one of these technologies. We are especially happy that we could work to make the integration with systemd as seamless as possible. There’s still work on this front to do but we’re happy with where we’ve gotten so far.

rkt is so important because it fills a hole that was left by other container runtimes. It lets the operating system do what it does best, manage processes. We believe whole-heartedly when Lennart, creator and lead developer of the systemd project, states…

I believe in the rkt model. Integrating container and service management, so that there's a 1:1 mapping between containers and host services is an excellent idea. Resource management, introspection, life-cycle management of containers and services -- all that tightly integrated with the OS, that's how a container manager should be designed.

Lennart Poettering

Over the next few weeks, we’ll be posting a series of blog stories related to rkt. Follow Kinvolk on twitter to get notified when they go live and follow the story.

FOSDEM 2016 Wrap-up: Bowling with Containers

Another year, another trip to FOSDEM, arguably the best free & open source software event in the world, but definitely the best in Europe. FOSDEM offers an amazingly broad range of talks which is only surpassed by the richness of its hallway track… and maybe the legendary beer event. ;)

This year our focus was to talk to folks about rkt, the container runtime we work on with CoreOS, and meet more people from the container development community, along with the usual catching up with old friends.

On Saturday, Alban gave a talk with CoreOS’ Jon Boulle entitled “Container mechanics in rkt and Linux”, where Jon presented a general overview of the rkt project and Alban followed with a deep dive into how containers work on Linux, and in rkt specifically. The talk was very well attended. If you weren’t able to attend however, you can find the slides here.

For Saturday evening, we had organized a bowling event for some of the people involved in rkt, and containers on Linux in general. A majority of the people attending we’d not yet meet IRL. We finally got a chance to meet the team from Intel Poland who has been working on rkt’s LKVM stage1, the BlaBlaCar team—brave early adopters of rkt—as well as some folks from NTT and Virtuozzo. There were also a few folks we see quite often from Red Hat, Giant Swarm and of course the team from CoreOS. As it turns out, the best bowler was the aforementioned Jon Boulle, who bowled a very respectable score of 120.

Having taken the FOSDEM pilgrimage about 20 times collectively now, the Kinvolk team are veterans of the event. However, each year brings new, exciting topics of discussion. These are mostly shaped by one’s own interests (containers and SDN for us) but also by new trends within the community. We’re already excited to see what next year will bring. We hope to see you there!

Testing systemd Patches

It’s not so easy to test new patches for systemd. Because systemd is the first process started on boot, the traditional way to test was to install the new version on your own computer and reboot. However, this approach is not practical because it makes the development cycle quite long: after writing a few lines of code, I don’t want to close all my applications and reboot. There is also a risk that my patch contains some bugs and if I install systemd on my development computer, it won’t boot. It would then take even more time to fix it. All of this probably just to test a few lines of code.

This is of course not a new problem and systemd-nspawn was at first implemented in 2011 as a simple tool to test systemd in an isolated environment. During the years, systemd-nspawn grew in features and became more than a testing tool. Today, it is integrated with other components of the systemd project such as machinectl and it can pull container images or VM images, start them as systemd units. systemd-nspawn is also used as an internal component of the app container runtime, rkt.

When developing rkt, I often need to test patches in systemd-nspawn or other components of the systemd project like systemd-machined. And since systemd-nspawn uses recent features of the Linux kernel that are still being developed (cgroups, user namespaces, etc.), I also sometimes need to test a different kernel or a different machined. In this case, testing with systemd-nspawn does not help because I would still use the kernel installed on my computer and systemd-machined installed on my computer.

I still don’t want to reboot nor do I want to install a non-stable kernel or non-stable systemd patches on my development computer. So today I am explaining how I am testing new kernels and new systemd with kvmtool and debootstrap.

Getting kvmtool

Why kvmtool? I want to be able to install systemd in my test environment easily with just a “make install”. I don’t want to have to prepare a testing image for each test but instead just use the same filesystem.

$ cd ~/git
$ git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/will/kvmtool
$ cd kvmtool && make

Compiling a kernel

The kernel is compiled as usual but with the options listed in kvmtool’s README file (here’s the .config file I use). I just keep around the different versions of the kernels I want to test:

$ cd ~/git/linux
$ ls bzImage*
bzImage      bzImage-4.3         bzImage-cgroupns.v5  bzImage-v4.1-rc1-2-g1b852bc
bzImage-4.1  bzImage-4.3.0-rc4+  bzImage-v4.1-rc1     bzImage-v4.3-rc4-15-gf670268

Getting the filesystem for the test environment

The man page of systemd-nspawn explains how to install a minimal Fedora, Debian or Arch distribution in a directory with the dnf, debootstrap or pacstrap commands respectively.

sudo dnf -y --releasever=22 --nogpg --installroot=${HOME}/distro-trees/fedora-22 --disablerepo='*' --enablerepo=fedora install systemd passwd dnf fedora-release vim-minimal

Set the root password of your fedora 22 the first time, and then you are ready to boot it:

sudo systemd-nspawn -D ${HOME}/distro-trees/fedora-22 passwd

I don’t have to actually boot it with kvmtool to update the system. systemd-nspawn is enough:

sudo systemd-nspawn -D ${HOME}/distro-trees/fedora-22 dnf update

Installing systemd

$ cd ~/git/systemd
$ ./autogen.sh
$ ./configure CFLAGS='-g -O0 -ftrapv' --enable-compat-libs --enable-kdbus --sysconfdir=/etc --localstatedir=/var --libdir=/usr/lib64
$ make
$ sudo DESTDIR=$HOME/distro-trees/fedora-22 make install
$ sudo DESTDIR=$HOME/distro-trees/fedora-22/fedora-tree make install

As you notice, I am installing systemd both in ~/distro-trees/fedora-22 and ~/distro-trees/fedora-22/fedora-tree. The first one is for the VM started by kvmtool, and the second is for the container started by systemd-nspawn inside the VM.

Running a test

I can easily test my systemd patches quickly with various versions of the kernel and various Linux distributions. I can also start systemd-nspawn inside lkvm if I want to test the interaction between systemd, systemd-machined and systemd-nspawn. All of this, without rebooting or installing any unstable software on my main computer.

I am sourcing the following in my shell:

test_kvm() {
        distro=$1
        kernelver=$2
        kernelparams=$3

        kernelimg=${HOME}/git/linux/bzImage-${kernelver}
        distrodir=${HOME}/distro-trees/${distro}

        if [ ! -f $kernelimg -o ! -d $distrodir ] ; then
                echo "Usage: test_kvm distro kernelver kernelparams"
                echo "       test_kvm f22 4.3 systemd.unified_cgroup_hierarchy=1"
                return 1
        fi

        sudo ${HOME}/git/kvmtool/lkvm run --name ${distro}-${kernelver} \
                --kernel ${kernelimg} \
                --disk ${distrodir} \
                --mem 2048 \
                --network virtio \
                --params="${kernelparams}"
}

Then, I can just test rkt or systemd-nspawn with the unified cgroup hierarchy:

$ test_kvm fedora-22 4.3 systemd.unified_cgroup_hierarchy=1

Conclusion

With this setup, I could test cgroup namespaces in systemd-nspawn with the kernel patches that are being reviewed upstream and my systemd patches without rebooting or installing them on my development computer.