How does Karpenter scale out nodes?

4 min readJan 19, 2025

This article is written based on Karpenter v1.1.1. Details might vary per version.

Karpenter helps with the two most important parts of a Kubernetes cluster autoscaler: scaling in and out. In this article, we’ll discuss when Karpenter decides to scale out and the full process of making more space for pods.

When does the provisioning control loop start for karpenter?

Karpenter is also an operator, creating a control loop and checking for events within the cluster. The loop for provisioning runs immediately right after the previous loop is finished. This process is handled inside the provisioner controller, in this file.

Karpenter basically “simulates” the scheduling behavior we expect from the kube-scheduler. It works on these steps:

Check all pending pods and the reschedulable pods inside a node marked for deletion. Nodes that are marked for deletion are the nodes in these conditions:

1. The Node has MarkedForDeletion set
2. The Node has a NodeClaim counterpart and is actively deleting (or the nodeclaim is marked as terminating)
3. The Node has no NodeClaim counterpart and is actively deleting

And for the pods within those nodes to be reschedulable, they need to be:

1. Is an active pod (isn't terminal or actively terminating) OR Is owned by a StatefulSet and Is Terminating
2. Isn't owned by a DaemonSet

Then karpenter sorts the list in descending order by CPU and memory requests, and changes that sorted list into a queue.

(Simulation) Scheduling the pods one by one

Karpenter now pops a pod at a time, simulating the scheduling result, and determines whether a new node is necessary or not.

This is the part where Karpenter tries to mimic the kube-scheduler’s scheduling logic, but this causes some issues like Karpenter not supporting MatchLabelKeys in TopologySpreadConstraints. To apply these new features coming up from every new version of Kubernetes, Karpenter needs continuous updates on the scheduler.

Karpenter schedules pods on nodes preffered in the following order:

Schedule in existing in-flight real nodes.
Schedule in new NodeClaims that were already planned to be generated.
Create a new NodeClaim.

And of course, it checks if the pods requirements and the nodes requirements match.

The pod must tolerate the node’s taints.
If the node is an in-flight active node, check if the pod doesn’t exceed the node’s volume limits.
Check for any conflicts in host ports.
Check if the total resource request including the new pod. If it’s a new NodeClaim, add up the total requests. If it’s a in-flight node, check if the resource request sum fits in the node’s capacity.
Check the nodeAffinity and nodeSelector requirements whether if the pod is compatible with the node.
Finally, check the topology requirements, combining the key-values mentioned in the topologySpreadConstraint and nodeAffinities. Preffered affinities are not part of the calculation here.

This process is repeated until all pods in the queue are simulated.

Creating a node from a NodeClaim

Now, if the scheduling process leaves a NodeClaim or more, the NodeClaims are finally synced to the cluster, making the custom resources registered.

Because NodeClaims is the representation of Kubernetes Node Object requests that reach the cloud provider through Karpenter, this custom resource goes through 3 steps in its lifecycle. According to the documentation webpage:

In addition to tracking the lifecycle of Nodes, NodeClaims serve as requests for capacity. Karpenter creates NodeClaims in response to provisioning and disruption needs (pre-spin). Whenever Karpenter creates a NodeClaim, it asks the cloud provider to create the instance (launch), register and link the created node with the NodeClaim (registration), and wait for the node and its resources to be ready (initialization).

How NodeClaims work together with the controller, nodes, and the cloud provider (AWS) instance.

So the scheduling part of provisioning was the pre-spin step, determining the size of the request to generate a specific NodeClaim.

This now triggers a separate control loop for the NodeClaim custom resource controller, in this code. This is connected to the cloud provider’s implementation of the CloudProvider.Create() method, which is intended to launch a NodeClaim with the given resource requests and requirements and returns a hydrated NodeClaim.

For the registration step, Karpenter finds the node matching the NodeClaim’s spec.providerID , and fills up the assigned labels, annotations, and taints that were requested within the NodeClaim spec.

Finally, the last step for initializing the NodeClaim, syncing the status of the node whether it’s ready or not with the NodeClaim, updating the lifecycle status inside the status section in the NodeClaim object.

With all these steps, Karpenter reads pods that fail to schedule, or those that are not in a stable situation and simulates the scheduling process, then reads the demand for new nodes in that scheduling result and creates new nodes out of them by the seamless connection of Kubernetes API & the cloud provider API.

How does Karpenter scale out nodes?

When does the provisioning control loop start for karpenter?

(Simulation) Scheduling the pods one by one

Creating a node from a NodeClaim

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Yany Choi

No responses yet