Kubernetes Beyond VMs: Deep Dive into Bare-Metal efficiency with SIGHUP Distribution

SIGHUP Distribution provides everything you need to handle bare-metal complexity, enabling you to run Kubernetes on bare metal nodes with high-density pod count.

Kubernetes Beyond VMs: Deep Dive into Bare-Metal efficiency with SIGHUP Distribution

In the world of cloud-native infrastructure, the VM-based model has long been the standard. Driven by the undeniable need for raw performance and cost efficiency, an alternative approach is getting traction: running Kubernetes directly on bare metal.
From increase in hypervisor's costs and the rise of GPU computational hunger driven by AI, the premise was already in place for a while.

Our customers have increasingly been asking to maximize the value of their powerful bare-metal servers. They don't want to be bound by the default 110-pod-per-node limit, which is designed for smaller VM based instances. They want to know: can Kubernetes, really scale to make modern hardware worth every penny spent on it?

We decided to answer that question definitively and we are proud to announce this capability through our SIGHUP Kubernetes Distribution.

In the next lines, we will coast through some of the modifications that were needed to be applied and if the results met our expectations.

The Goal: Beyond 110 Pods

Our goal was to test the SIGHUP Distribution (v1.33.1) on server-grade machines and push well beyond the defaults. We set an initial target of running at least 350 pods per node.

Our testing environment consisted of four identical bare-metal machines, each with 40 CPUs and 512GB of RAM, running Ubuntu 24.04. We configured a three-node HA control plane and a single worker node.

Key Challenges and How SIGHUP Distribution Solved Them

Simply installing Kubernetes and setting maxPods to 500 doesn't work. You immediately hit bottlenecks at the OS, network, and etcd layers. Here’s what we found and how SD provides the necessary controls.

1. Don't Waste Your Control Plane

In a high-density, resource-rich environment, dedicating three powerful servers only to the control plane is a significant waste.
For the purpose of our testing strategy, we made the scheduling on the control-plane possible by removing the taints on the control-plane nodes.

YAML

spec:
  kubernetes:
    masters:
      ...
      taints: []

2. Tuning the Kernel, the Cloud-Native Way

As we approached 350 pods on a single node, we hit our first major wall: too many open files errors. The issue? The kubelet heavily uses inotify to watch file changes, and Ubuntu’s default limits are far too conservative for this level of density.

Although manually SSH-ing into nodes to change sysctl parameters is viable, it is not a scalable solution. As part of this effort, we've implemented a declarative way to do it. Users can now specify arbitrary Linux kernel parameters, which will be applied on nodes via sysctl, allowing the granular control needed without losing the git-friendly approach.

YAML

spec:
  kubernetes:
    advanced:
      kernelParameters:
      - name: "fs.inotify.max_user_instances"
        value: "8192"
      - name: "fs.inotify.max_user_watches"
        value: "524288"

3. Adapting the CNI's for High density nodes

To run 500 pods on a node, your networking stack needs to be configured for it from day one. We tested both Cilium and Calico CNI's and both could handle the networking with proper configurations.

The real bottlenecks

Solving the configuration and implementing it in a sane way was only half the battle. The tests revealed critical insights into how the underlying components behave under extreme load.

1. Your etcd is the real "Deal breaker"

During our initial runs, we observed etcd leader re-elections and high API server latencies. The cause was high I/O latency on the system disks. etcd is incredibly sensitive to write latency.
Our Recommendation: This is non-negotiable. Follow etcd's hardware recommendations. Run fio tests on your intended disks. The 99th percentile for sync operations must be less than 10ms. If it's not, use faster disks or a dedicated SAN.

2. The kube-proxy vs. Cilium eBPF Debate

We tested clusters with and without the traditional kube-proxy, using Cilium's eBPF replacement instead. While we did observe that Cilium's replacement can lead to lower pod latencies, the results were not always dramatically different. This suggests that at this scale, other factors (like etcd latency or kernel tuning) can have an equal or greater impact.

3. Monitoring Your (Large) Footprint

More pods mean more metrics. We found that Prometheus's RAM usage scaled directly with the pod count, reaching ~5GB during our most intensive runs.

A key discovery was with the Prometheus Adapter. With the default installEnhancedHPAMetrics set to true, its resource usage was very high. By setting this to false (if you don't need enhanced HPA metrics), we reduced its RAM usage from 1.68GB to just 46.5MB. This is another critical tuning parameter provided in the SD configuration.

The Results: Success at Scale

So, did we meet our goal?

Yes, and we surpassed it. We found that SIGHUP Distribution is more than capable of running ~500 pods per node.

Across our 4-node cluster, we ran tests creating 1800 pods. Even at 93% of total pod capacity, the cluster remained stable. The maximum time from a pod being scheduled to being "Ready" was a mere 34 seconds, demonstrating the robustness of the control plane and CNI under pressure.

Trust the Peter's Tingle

As Uncle Ben said, with great power comes great responsibility.

Virtualization provides incredible features and flexibility that are lost on bare metal.
Moving to bare metal significantly reduces the elasticity and agility that you get with virtualization. The most immediate impact is on scaling. In a VM-based environment, adding a new Kubernetes node is a fast API call; a new VM can be provisioned from a template and join the cluster in minutes. On bare metal, this same action becomes a slow, physical, and manual process. It involves procuring hardware, racking it, cabling it, and provisioning the OS, turning a task that took minutes into one that can take days, weeks, or even months.
As per the features, you can no longer take an instantaneous snapshot of a node before a risky operation and simply revert if it fails. The critical ability to live-migrate a workload from one physical server to another for hardware maintenance without downtime is also gone. You also lose built-in features like hypervisor-level high availability, which automatically restarts a failed VM on another host, forcing you to rely solely on Kubernetes's own pod-rescheduling mechanisms when a physical machine dies.

Hybrid approach

While we used 4 physical nodes for testing purposes, having an hybrid approach with a VM based control-plane is probably more appropriate.
Having a VM-based control plane makes the sizing of the nodes easier and less wasteful on resources, leaving you extra room for error and critical situations, while having the worker nodes running on pure bare metal and making the heavy lifting without overlay's waste of resources.

Bare Metal is Ready for Prime Time

Running Kubernetes on bare metal is complex, but the barriers are no longer technical; they are operational. The game-changer is moving from manual, imperative management to a declarative, API-driven model.

Our findings show that SIGHUP Distribution provides the critical configuration hooks, sane defaults, and automation necessary to tame bare-metal complexity. You get the performance of raw hardware with the declarative, cloud-native experience you expect.

Read our new "SD On High-Density Bare-Metal Nodes" report for all the technical details, result charts, and complete configurations we've used.

When you're ready to stop paying the hypervisor tax and unlock the full potential of your hardware, contact our team for a deep-dive on the SIGHUP Distribution.