Skip to main content

Story of deploying pod to every node and preventing them from termination

 During the development of one of our new features, I faced an interesting challenge. The feature requests are simple and clear: there is an existing DaemonSet (workload running on "every" node) on the target Kubernetes cluster, we have to deploy another workload next to each instance and prevent workload termination under certain conditions. Let's split the problem into two parts; deployment and prevention.

From the deployment perspective, another DaemonSet makes lots of sense. If we use the same node selectors as the existing one, Kubernetes would deploy pods to the same nodes. In our case a custom operator is working in the background, so we are able to sync node selectors, but for other kinds of deployments this should be a tricky piece.

On the topic of prevention PodDisruptionBudget [PDB] comes into the picture. Without going into the details PDB allows us to define how many of the target pods should be terminated by Kubernetes at once. It has a maxUnavailable field, and if we set it to 1, it means Kubernetes is only able to drain/upgrade/downscale nodes one by one. That sounds like a solution, but we aren't lucky enough this time, because it doesn't support DaemonSet, only a few other higher-level abstractions. DaemonSet is an arbitrary controller and only minAvailable is supported with some other limitations. While maxUnavailable should be a static number, minAvailable needs to follow the number of replicas to ensure only 1 pod should be terminating in a time. As I have mentioned we already have an operator which should update the value based on the number of DaemonSet pods.

In a summary;

  • Deploy a DaemonSet with the same node selectors as the target DaemonSet
  • Actively monitor and update node selectors to ensure pod locality
  • Deploy a PDB with the right number of minAvailable
  • Update minAvailable each time when the number of pods has changed
Let's give it a try. The solution looks to work for the first time, PDB controller calculates the right amount of allowed disruption all the time. But testing it on different managed Kubernetes services the overall result doesn't shine as we expected. After a bit of researching, I have found that node drain request has an option to --ignore-daemonsets. So seems some providers are using this option and simply terminating DaemonSet pods without taking care of PDB. WAT! Time to find another solution.

Because DaemonSet is not an option we have to go with Deployment (ReplicaSet, StatefulSet, etc.) The simple part is: we are able to use maxUnavailable of PDB, but how to ensure each node has only one instance. The first candidate should be Pod Topology Spread Constraints [PTSC]. In a nutshell with PTSP we are able to define pod distribution on the cluster based on topology keys (node labels). Sounds promising. Change operator to update replicas of Deployment instead of minAvailable of PDB.

kubectl apply -f https://gist.githubusercontent.com/mhmxs/d18ef95bc61d4c653b85aa3235d7d3e6/raw/9c0bad9ad9549be98d7136056ddcd14dd25b0479/deploymentwithPTSC.yaml
kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE    IP          NODE                 
nginx-deployment-65c9b96899-4nj8s   1/1     Running   0          102s   10.38.0.2   kind-worker2         
nginx-deployment-65c9b96899-d9kgv   1/1     Running   0          102s   10.40.0.3   kind-worker          
nginx-deployment-65c9b96899-lr45h   1/1     Running   0          102s   10.32.0.3   kind-control-plane   

For the first view, it did the thing, but what happens if we increase the number of replicas?

kubectl scale --replicas=4 deploy/nginx-deployment
deployment.apps/nginx-deployment scaled
kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP          NODE                 
nginx-deployment-65c9b96899-4vbjj   1/1     Running   0          49s   10.38.0.1   kind-worker2         
nginx-deployment-65c9b96899-7zrbz   1/1     Running   0          87s   10.38.0.2   kind-worker2         
nginx-deployment-65c9b96899-bwdwr   1/1     Running   0          89s   10.32.0.3   kind-control-plane   
nginx-deployment-65c9b96899-wvd45   1/1     Running   0          91s   10.40.0.3   kind-worker          

Seems DoNotSchedule wasn't strong enough to block pod placement. What else we should do?

In Kubernetes there is another concept called Pod Affinity/Anti-affinity. For details please follow the documentation, it is a huge topic, but for now, I would like to use pod anti-affinity for the pod itself during scheduling. So in summary the Deployment pod doesn't tolerate itself. At this point optionally we are able to remove PTSC from the Deployment.

kubectl apply -f https://gist.githubusercontent.com/mhmxs/85ba6fa37df21f26fb30991b3d769005/raw/b9f26c6764bce2cd10bc33d4fd87474d66de203c/deploymentWithPAA.yaml
deployment.apps/nginx-deployment configured
kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP          NODE                 
nginx-deployment-56cc8569bd-wh4wv   0/1     Pending   0          24s                  
nginx-deployment-65c9b96899-7zrbz   1/1     Running   0          14m   10.38.0.2   kind-worker2         
nginx-deployment-65c9b96899-bwdwr   1/1     Running   0          14m   10.32.0.3   kind-control-plane   
nginx-deployment-65c9b96899-wvd45   1/1     Running   0          14m   10.40.0.3   kind-worker          

Ooops, there is an error! Let's describe the pending pod.

kubectl describe po nginx-deployment-56cc8569bd-wh4wv
Name:           nginx-deployment-56cc8569bd-wh4wv
Namespace:      default
Priority:       0
Node:           
Labels:         app=nginx
                pod-template-hash=56cc8569bd
Annotations:    
Status:         Pending
Conditions:
  Type           Status
  PodScheduled   False 
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  57s (x4 over 2m28s)  default-scheduler  0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't match pod anti-affinity rules.

The reason is simple. The new version of pod doesn't tolerate the old version, that's why it is pending. To solve this problem we have to downscale deployment to 0, and upscale to the right number at the end. I think we all agree this is just a painful workaround and not a solution.

After a long time of researching, documentation reading, and doing experiments I felt very desperate you can believe it. At this point I had read the manifest deployed by Kubernetes and found that Kubernetes had changed my original, and configured rolling upgrade strategy:

Because my main goal is to ensure pod termination one by one, it made sense to replace strategy with exact numbers instead of percentages.

kubectl apply -f https://gist.githubusercontent.com/mhmxs/e07dcf97a68041de7e52f188356fdafa/raw/28a9e30a8ca333603ba995d32d45cdc93face432/deploymentwithPAAAndStrategy.yaml
deployment.apps/nginx-deployment configured
kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP          NODE                 NOMINATED NODE   READINESS GATES
nginx-deployment-56cc8569bd-cqp5t   1/1     Running   0          34s   10.32.0.3   kind-control-plane   
nginx-deployment-56cc8569bd-s8jx2   1/1     Running   0          23s   10.40.0.3   kind-worker          
nginx-deployment-56cc8569bd-wh4wv   1/1     Running   0          23m   10.38.0.1   kind-worker2         

The result was surprising at first. Rolling upgrade strategy allowed Kubernetes to terminate one running pod before starting the new one, so pod anti-affinity can't block pod deployment anymore.

Seems I did it! Just a few tweaks and it should go to production :D. This workload is a critical one, so I gave a priorityClassName and a bunch of tolerations. Pod priority defines how important are the pod (Kubernetes terminates less important first in the case of pressure) and toleration allows to schedule pod to none healthy nodes. (I suggest implementing a graceful termination also based on your needs)

kubectl apply -f https://gist.githubusercontent.com/mhmxs/8a2a95bd9e0c86051c6ea125a9dc6c00/raw/3ef7e4da65025da204dba944461634d977e53ec2/deploymentwithPAAAndStrategyAndToleration.yaml
deployment.apps/nginx-deployment configured
kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP          NODE                 NOMINATED NODE   READINESS GATES
nginx-deployment-865777c646-2vpqr   1/1     Running   0          3m15s   10.38.0.1   kind-worker2         
nginx-deployment-865777c646-5jc4m   1/1     Running   0          3m23s   10.32.0.3   kind-control-plane   
nginx-deployment-865777c646-tm6cq   1/1     Running   0          3m23s   10.40.0.3   kind-worker          

If you think this solution is not "Cloud Native" at all, you might be right. But keep in mind, I'm a storage developer and customer data is the first place for me. With this solution, we are able to monitor volumes on the cluster per node, and control when and how Kubernetes is able to drain data.

Popular posts from this blog

Advanced testing of Golang applications

Golang has a nice built-in framework for testing production code and you can find many articles on how to use it. In this blog post, I don't want to talk too much about the basics , table-driven testing ,  how to generate code coverage  or detect race conditions . I would like to share my personal experiences with a real-world scenario. Go is a relatively young and modern programming language on one side, and it is an old fashion procedural language on the other. You have to keep in mind that fact when you are writing production code from the beginning, otherwise, your program should become an untestable mess so easily. In a procedural way, your program is executed line by line and functions call other functions without any control of the dependencies. Hard to unit test, because you are testing underlying functions too, which are side effects from the perspective of testing.  It looks like everything is static if you are coming from object-oriented world. There are...

Kubernetes and Calico development environment as easy as a flick

I became an active member of the Calico community so I had to build my own development environment from zero. It wasn't trivial for many reasons but mainly because I have MacOS on my machine and not all of the features of Calico are available on my main operating system. The setup also makes some sense on Linux hosts, because if the node controller runs locally it might make changes to the system, which always has some risk in the playing cards. The other big challenge was that I wanted to start any version of Kubernetes with the ability to do changes in it next to Calico. Exactly I had to prepare two tightly coupled environments. My idea was to create a virtual machine with Linux on it, configure development environments for both projects in the VM and use VSCode 's nice remote development feature for code editing. In this way projects are hosted on the target operating system, I don't risk my system, I don't have to deal with poor file system sync between host a...

Autoscaling Calico Route Reflector topology in Kubernetes

Kubernetes is a great tool to organize your workloads on a low or high scale. It has many nice features in different areas, but it is totally out-sourcing the complexity of the network. Network is one of the key layers of a success story and happily there are many available solutions on the market. Calico is one of them, and it is I think the most used network provider, including big players in public cloud space and has a great community who works day by day to make Calico better. Installing Kubernetes and Calico nowadays is easy as a flick if you are happy with the default configurations. Otherwise, life became tricky very easily, there are so many options, configurations, topologies, automation, etc. Surprise or not, networking is one of the hard parts in high scale, and requires thorough design from the beginning. By default Calico uses IPIP encapsulation and full mesh BGP to share routing information within the cluster. This means every single node in the cluster is connected w...