How to run untrusted containers in Kubernetes

Considering that even large companies are approaching container-based infrastructures, it is a good idea to account for possible attack areas and the potential impact of a data breach.

Technologies like Docker (containerd) and LXC are not really isolated systems as they share the same Linux Kernel as the operating system they are hosted on.

For a potential attacker, having their container launched within a large company is a golden opportunity. Does technology allow us to defend ourselves easily?

Current container technology

It’s been repeated several times, Containers are a new way of packaging, sharing, and deploying applications, as opposed to monolithic applications where all features are packaged in a single software or possibly an operating system.

Right now, containers do not exploit anything new, but they are an evolution created above the Linux Namespace and Linux Control Groups. The namespace creates a virtual and isolated user space and provides the application with its system resources such as file system, network and processes. This abstraction allows the application to be started independently without interfering with other applications running on the same host.

So, thanks to namespaces and cgroup together, we can definitely start many applications running in the same host in an isolated environment.

Container vs Virtual Machine

It is clear that container technology solves a problem in terms of isolation, portability and streamlined architecture compared to a virtual machine environment. But let’s not forget that the virtual machine allows us to isolate our application, especially at the kernel level, and therefore the risk that a hacker can escape a container and compromise the system is much higher than escaping a virtual machine.

Most kernel Linux exploits might work for containers, and this could allow them to escalate and compromise not only the affected namespace but also other namespaces within the same operating system.

These security issues led researchers to try to create real separated namespaces from the host. Specifically called “sandboxed” and right now there are several solutions that provide these features: gVisor or, for example, Kata Containers.

Container Runtime in Kubernetes

We can go deeper into this type of technology within a container orchestrator, Kubernetes.

Kubernetes uses a component, kubelet, to manage containers. We can define it as the captain of the ship that takes charge of the specifications that are supplied to it and performs its operations punctually and precisely.

Kubelet takes pod specifications and makes them run as containers on the host machine to which they are assigned and can interact with any container runtime as long as it is OCI standard compliant (whose implementation is RunC)

How container runtime works

RunC, originally embedded into the Docker architecture, was released in 2015 as a standalone tool. It has become a commonly used, standard, cross-functional container runtime that DevOps teams can use as part of container engines.

RunC provides all the functionality to interact with existing low-level Linux features. It uses namespaces and control groups to create and run container processes.

In the following paragraph, we will introduce the runtime class and the core elements. There is also the RuntimeClass handler whose default value will be RunC (for installations of Kubernetes that use containerd as the container runtimes).

Runtime Class

The Runtime Class, as its name suggests, allows us to operate with various container runtimes. In 2014, Docker was the only runtime container available on Kubernetes. Starting with version 1.3 of Kubernetes, compatibility with Rocket (RKT) was added and finally with Kubernetes 1.5, the Container Runtime Iterface (CRI) was introduced with the possibility of having a standard interface and all the container runtime you can interface directly, with this standard saving developers the trouble of adapting to all types of container runtimes and worrying about version maintenance.

The CRI, in fact, has allowed us to decouple the part of container runtime from Kubernetes and, above all, allowed technologies such as Kata Containers and gVisor to connect to a container runtime as containerd.

In Kubernetes 1.14, RuntimeClass was introduced again as a built-in cluster resource with a handler property at its core.

The handler indicates the program that receives the container creation request and corresponds to a container runtime.

kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
    name: #RuntimeClass Name
handler: #container runtime for example: runc
overhead:
    podFixed:
        memory: "" # 64Mi
        cpu: "" # 250m
scheduling:
  nodeSelector:
    <key>: <value> # container-rt: gvisor
  • The handler field points to specific container runtime or configuration to be used.
  • Declaring overhead allows the cluster (including the scheduler) to account for it when making decisions about Pods and resources. Through the use of these fields, you can specify the overhead of running pods utilizing this RuntimeClass and ensure these overheads are accounted for in Kubernetes.
  • The scheduling field is used to ensure that Pod are scheduled on the right nodes.

By default, our handler, if we have a cluster with Docker or containerd, is runc, but if we use gVisor it will be runsc.

Isolate Linux Host and Containers with gVisor in Kubernetes

Now we will see how can we have more than one container runtime in a Kubernetes cluster and choosing a more strict one for sensitive workload.

For this tutorial I use a previous project where I carried out the installation of a Kubernetes cluster with containerd.

GitHub - alessandrolomanto/k8s-vanilla-containerd: Can Kubernetes run without docker?
Can Kubernetes run without docker? Contribute to alessandrolomanto/k8s-vanilla-containerd development by creating an account on GitHub.

Initialize the Kubernetes Cluster

make vagrant-start

After bringing machines up, verify if all components are up and running

vagrant ssh master

kubectl get nodes
NAME      STATUS   ROLES                  AGE     VERSION

master    Ready    control-plane,master   7m59s   v1.21.0

worker1   Ready    <none>                 5m50s   v1.21.0

worker2   Ready    <none>                 3m51s   v1.21.0

Install gVisor on worker1

ssh worker1 # Vagrant default password: vagrant

sudo su

Install latest gVisor release

(
  set -e
  ARCH=$(uname -m)
  URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH}
  wget ${URL}/runsc ${URL}/runsc.sha512 \\
    ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
  sha512sum -c runsc.sha512 \\
    -c containerd-shim-runsc-v1.sha512
  rm -f *.sha512
  chmod a+rx runsc containerd-shim-runsc-v1
  sudo mv runsc containerd-shim-runsc-v1 /usr/local/bin
)
FINISHED --2022-04-28 07:24:44--

Total wall clock time: 5.2s

Downloaded: 4 files, 62M in 3.1s (20.2 MB/s)

runsc: OK

containerd-shim-runsc-v1: OK

Configure container runtime

cat <<EOF | sudo tee /etc/containerd/config.toml
version = 2
[plugins."io.containerd.runtime.v1.linux"]
  shim_debug = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
EOF

Restart containerd service

sudo systemctl restart containerd

Install the RuntimeClass for gVisor

cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
EOF

Verify:

vagrant@master:~$ kubectl get runtimeclass

NAME     HANDLER   AGE

gvisor   runsc     17s

Create a Pod with the gVisor RuntimeClass:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: nginx-gvisor
spec:
  runtimeClassName: gvisor
  containers:
  - name: nginx
    image: nginx
EOF

Verify that the Pod is running:

kubectl get pod nginx-gvisor -o wide
vagrant@master:~$ kubectl get pod nginx-gvisor -o wide

NAME           READY   STATUS    RESTARTS   AGE   IP                NODE      NOMINATED NODE   READINESS GATES

nginx-gvisor   1/1     Running   0          31s   192.168.235.129   worker1   <none>           <none>

For updated info, follow official docs.

Conclusions

We have looked that the current container technology has  weak isolation issue.
Common practices such as patching containers quickly and least security contexts privilege can effectively restrict the attack surface. We should start even implementing runtime security measures like the tutorial above because now it is possible to have  more than one container runtime.

Sure, it is not something that everyone needs, but it will certainly come in handy when you want to run untrusted containers without impacting the host in any way.

Let’s consider that you are a container hosting service and start containers of different customers on the same host machine. Could you compromise other customers because of the shared context? Start thinking about how to mitigate these issues.