Story of deploying pod to every node and preventing them from termination

During the development of one of our new features, I faced an interesting challenge. The feature requests are simple and clear: there is an existing DaemonSet (workload running on "every" node) on the target Kubernetes cluster, we have to deploy another workload next to each instance and prevent workload termination under certain conditions. Let's split the problem into two parts; deployment and prevention.

From the deployment perspective, another DaemonSet makes lots of sense. If we use the same node selectors as the existing one, Kubernetes would deploy pods to the same nodes. In our case a custom operator is working in the background, so we are able to sync node selectors, but for other kinds of deployments this should be a tricky piece.

On the topic of prevention PodDisruptionBudget [PDB] comes into the picture. Without going into the details PDB allows us to define how many of the target pods should be terminated by Kubernetes at once. It has a maxUnavailable field, and if we set it to 1, it means Kubernetes is only able to drain/upgrade/downscale nodes one by one. That sounds like a solution, but we aren't lucky enough this time, because it doesn't support DaemonSet, only a few other higher-level abstractions. DaemonSet is an arbitrary controller and only minAvailable is supported with some other limitations. While maxUnavailable should be a static number, minAvailable needs to follow the number of replicas to ensure only 1 pod should be terminating in a time. As I have mentioned we already have an operator which should update the value based on the number of DaemonSet pods.

In a summary;

Deploy a DaemonSet with the same node selectors as the target DaemonSet
Actively monitor and update node selectors to ensure pod locality
Deploy a PDB with the right number of minAvailable
Update minAvailable each time when the number of pods has changed

Let's give it a try. The solution looks to work for the first time, PDB controller calculates the right amount of allowed disruption all the time. But testing it on different managed Kubernetes services the overall result doesn't shine as we expected. After a bit of researching, I have found that node drain request has an option to --ignore-daemonsets. So seems some providers are using this option and simply terminating DaemonSet pods without taking care of PDB. WAT! Time to find another solution.

Because DaemonSet is not an option we have to go with Deployment (ReplicaSet, StatefulSet, etc.) The simple part is: we are able to use maxUnavailable of PDB, but how to ensure each node has only one instance. The first candidate should be Pod Topology Spread Constraints [PTSC]. In a nutshell with PTSP we are able to define pod distribution on the cluster based on topology keys (node labels). Sounds promising. Change operator to update replicas of Deployment instead of minAvailable of PDB.

kubectl apply -f https://gist.githubusercontent.com/mhmxs/d18ef95bc61d4c653b85aa3235d7d3e6/raw/9c0bad9ad9549be98d7136056ddcd14dd25b0479/deploymentwithPTSC.yaml

kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE    IP          NODE                 
nginx-deployment-65c9b96899-4nj8s   1/1     Running   0          102s   10.38.0.2   kind-worker2         
nginx-deployment-65c9b96899-d9kgv   1/1     Running   0          102s   10.40.0.3   kind-worker          
nginx-deployment-65c9b96899-lr45h   1/1     Running   0          102s   10.32.0.3   kind-control-plane

For the first view, it did the thing, but what happens if we increase the number of replicas?

kubectl scale --replicas=4 deploy/nginx-deployment
deployment.apps/nginx-deployment scaled

kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP          NODE                 
nginx-deployment-65c9b96899-4vbjj   1/1     Running   0          49s   10.38.0.1   kind-worker2         
nginx-deployment-65c9b96899-7zrbz   1/1     Running   0          87s   10.38.0.2   kind-worker2         
nginx-deployment-65c9b96899-bwdwr   1/1     Running   0          89s   10.32.0.3   kind-control-plane   
nginx-deployment-65c9b96899-wvd45   1/1     Running   0          91s   10.40.0.3   kind-worker

Seems DoNotSchedule wasn't strong enough to block pod placement. What else we should do?

In Kubernetes there is another concept called Pod Affinity/Anti-affinity. For details please follow the documentation, it is a huge topic, but for now, I would like to use pod anti-affinity for the pod itself during scheduling. So in summary the Deployment pod doesn't tolerate itself. At this point optionally we are able to remove PTSC from the Deployment.

kubectl apply -f https://gist.githubusercontent.com/mhmxs/85ba6fa37df21f26fb30991b3d769005/raw/b9f26c6764bce2cd10bc33d4fd87474d66de203c/deploymentWithPAA.yaml
deployment.apps/nginx-deployment configured

kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP          NODE                 
nginx-deployment-56cc8569bd-wh4wv   0/1     Pending   0          24s                  
nginx-deployment-65c9b96899-7zrbz   1/1     Running   0          14m   10.38.0.2   kind-worker2         
nginx-deployment-65c9b96899-bwdwr   1/1     Running   0          14m   10.32.0.3   kind-control-plane   
nginx-deployment-65c9b96899-wvd45   1/1     Running   0          14m   10.40.0.3   kind-worker

Ooops, there is an error! Let's describe the pending pod.

kubectl describe po nginx-deployment-56cc8569bd-wh4wv
Name:           nginx-deployment-56cc8569bd-wh4wv
Namespace:      default
Priority:       0
Node:           
Labels:         app=nginx
                pod-template-hash=56cc8569bd
Annotations:    
Status:         Pending
Conditions:
  Type           Status
  PodScheduled   False 
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  57s (x4 over 2m28s)  default-scheduler  0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't match pod anti-affinity rules.

The reason is simple. The new version of pod doesn't tolerate the old version, that's why it is pending. To solve this problem we have to downscale deployment to 0, and upscale to the right number at the end. I think we all agree this is just a painful workaround and not a solution.

After a long time of researching, documentation reading, and doing experiments I felt very desperate you can believe it. At this point I had read the manifest deployed by Kubernetes and found that Kubernetes had changed my original, and configured rolling upgrade strategy:

Because my main goal is to ensure pod termination one by one, it made sense to replace strategy with exact numbers instead of percentages.

kubectl apply -f https://gist.githubusercontent.com/mhmxs/e07dcf97a68041de7e52f188356fdafa/raw/28a9e30a8ca333603ba995d32d45cdc93face432/deploymentwithPAAAndStrategy.yaml
deployment.apps/nginx-deployment configured

kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE   IP          NODE                 NOMINATED NODE   READINESS GATES
nginx-deployment-56cc8569bd-cqp5t   1/1     Running   0          34s   10.32.0.3   kind-control-plane   
nginx-deployment-56cc8569bd-s8jx2   1/1     Running   0          23s   10.40.0.3   kind-worker          
nginx-deployment-56cc8569bd-wh4wv   1/1     Running   0          23m   10.38.0.1   kind-worker2

The result was surprising at first. Rolling upgrade strategy allowed Kubernetes to terminate one running pod before starting the new one, so pod anti-affinity can't block pod deployment anymore.

Seems I did it! Just a few tweaks and it should go to production :D. This workload is a critical one, so I gave a priorityClassName and a bunch of tolerations. Pod priority defines how important are the pod (Kubernetes terminates less important first in the case of pressure) and toleration allows to schedule pod to none healthy nodes. (I suggest implementing a graceful termination also based on your needs)

kubectl apply -f https://gist.githubusercontent.com/mhmxs/8a2a95bd9e0c86051c6ea125a9dc6c00/raw/3ef7e4da65025da204dba944461634d977e53ec2/deploymentwithPAAAndStrategyAndToleration.yaml
deployment.apps/nginx-deployment configured

kubectl get po -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP          NODE                 NOMINATED NODE   READINESS GATES
nginx-deployment-865777c646-2vpqr   1/1     Running   0          3m15s   10.38.0.1   kind-worker2         
nginx-deployment-865777c646-5jc4m   1/1     Running   0          3m23s   10.32.0.3   kind-control-plane   
nginx-deployment-865777c646-tm6cq   1/1     Running   0          3m23s   10.40.0.3   kind-worker

If you think this solution is not "Cloud Native" at all, you might be right. But keep in mind, I'm a storage developer and customer data is the first place for me. With this solution, we are able to monitor volumes on the cluster per node, and control when and how Kubernetes is able to drain data.

Autoscaling Calico Route Reflector topology in Kubernetes

Kubernetes is a great tool to organize your workloads on a low or high scale. It has many nice features in different areas, but it is totally out-sourcing the complexity of the network. Network is one of the key layers of a success story and happily there are many available solutions on the market. Calico is one of them, and it is I think the most used network provider, including big players in public cloud space and has a great community who works day by day to make Calico better. Installing Kubernetes and Calico nowadays is easy as a flick if you are happy with the default configurations. Otherwise, life became tricky very easily, there are so many options, configurations, topologies, automation, etc. Surprise or not, networking is one of the hard parts in high scale, and requires thorough design from the beginning. By default Calico uses IPIP encapsulation and full mesh BGP to share routing information within the cluster. This means every single node in the cluster is connected w...

mhmxs tech.log

Search This Blog

Story of deploying pod to every node and preventing them from termination

Labels

Popular posts from this blog

Advanced testing of Golang applications

Autoscaling Calico Route Reflector topology in Kubernetes

Kubernetes and Calico development environment as easy as a flick