During the development of one of our new features, I faced an interesting challenge. The feature requests are simple and clear: there is an existing DaemonSet (workload running on "every" node) on the target Kubernetes cluster, we have to deploy another workload next to each instance and prevent workload termination under certain conditions. Let's split the problem into two parts; deployment and prevention.
From the deployment perspective, another DaemonSet makes lots of sense. If we use the same node selectors as the existing one, Kubernetes would deploy pods to the same nodes. In our case a custom operator is working in the background, so we are able to sync node selectors, but for other kinds of deployments this should be a tricky piece.
On the topic of prevention PodDisruptionBudget [PDB] comes into the picture. Without going into the details PDB allows us to define how many of the target pods should be terminated by Kubernetes at once. It has a maxUnavailable field, and if we set it to 1, it means Kubernetes is only able to drain/upgrade/downscale nodes one by one. That sounds like a solution, but we aren't lucky enough this time, because it doesn't support DaemonSet, only a few other higher-level abstractions. DaemonSet is an arbitrary controller and only minAvailable is supported with some other limitations. While maxUnavailable should be a static number, minAvailable needs to follow the number of replicas to ensure only 1 pod should be terminating in a time. As I have mentioned we already have an operator which should update the value based on the number of DaemonSet pods.
In a summary;
- Deploy a DaemonSet with the same node selectors as the target DaemonSet
- Actively monitor and update node selectors to ensure pod locality
- Deploy a PDB with the right number of minAvailable
- Update minAvailable each time when the number of pods has changed
kubectl apply -f https://gist.githubusercontent.com/mhmxs/d18ef95bc61d4c653b85aa3235d7d3e6/raw/9c0bad9ad9549be98d7136056ddcd14dd25b0479/deploymentwithPTSC.yaml
kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE nginx-deployment-65c9b96899-4nj8s 1/1 Running 0 102s 10.38.0.2 kind-worker2 nginx-deployment-65c9b96899-d9kgv 1/1 Running 0 102s 10.40.0.3 kind-worker nginx-deployment-65c9b96899-lr45h 1/1 Running 0 102s 10.32.0.3 kind-control-plane
For the first view, it did the thing, but what happens if we increase the number of replicas?
kubectl scale --replicas=4 deploy/nginx-deployment deployment.apps/nginx-deployment scaled
kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE nginx-deployment-65c9b96899-4vbjj 1/1 Running 0 49s 10.38.0.1 kind-worker2 nginx-deployment-65c9b96899-7zrbz 1/1 Running 0 87s 10.38.0.2 kind-worker2 nginx-deployment-65c9b96899-bwdwr 1/1 Running 0 89s 10.32.0.3 kind-control-plane nginx-deployment-65c9b96899-wvd45 1/1 Running 0 91s 10.40.0.3 kind-worker
Seems DoNotSchedule wasn't strong enough to block pod placement. What else we should do?
In Kubernetes there is another concept called Pod Affinity/Anti-affinity. For details please follow the documentation, it is a huge topic, but for now, I would like to use pod anti-affinity for the pod itself during scheduling. So in summary the Deployment pod doesn't tolerate itself. At this point optionally we are able to remove PTSC from the Deployment.
kubectl apply -f https://gist.githubusercontent.com/mhmxs/85ba6fa37df21f26fb30991b3d769005/raw/b9f26c6764bce2cd10bc33d4fd87474d66de203c/deploymentWithPAA.yaml deployment.apps/nginx-deployment configured
kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE nginx-deployment-56cc8569bd-wh4wv 0/1 Pending 0 24s nginx-deployment-65c9b96899-7zrbz 1/1 Running 0 14m 10.38.0.2 kind-worker2 nginx-deployment-65c9b96899-bwdwr 1/1 Running 0 14m 10.32.0.3 kind-control-plane nginx-deployment-65c9b96899-wvd45 1/1 Running 0 14m 10.40.0.3 kind-worker
Ooops, there is an error! Let's describe the pending pod.
kubectl describe po nginx-deployment-56cc8569bd-wh4wv Name: nginx-deployment-56cc8569bd-wh4wv Namespace: default Priority: 0 Node: Labels: app=nginx pod-template-hash=56cc8569bd Annotations: Status: Pending Conditions: Type Status PodScheduled False Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 57s (x4 over 2m28s) default-scheduler 0/3 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't match pod anti-affinity rules.
The reason is simple. The new version of pod doesn't tolerate the old version, that's why it is pending. To solve this problem we have to downscale deployment to 0, and upscale to the right number at the end. I think we all agree this is just a painful workaround and not a solution.
After a long time of researching, documentation reading, and doing experiments I felt very desperate you can believe it. At this point I had read the manifest deployed by Kubernetes and found that Kubernetes had changed my original, and configured rolling upgrade strategy:
Because my main goal is to ensure pod termination one by one, it made sense to replace strategy with exact numbers instead of percentages.
kubectl apply -f https://gist.githubusercontent.com/mhmxs/e07dcf97a68041de7e52f188356fdafa/raw/28a9e30a8ca333603ba995d32d45cdc93face432/deploymentwithPAAAndStrategy.yaml deployment.apps/nginx-deployment configured
kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-56cc8569bd-cqp5t 1/1 Running 0 34s 10.32.0.3 kind-control-plane nginx-deployment-56cc8569bd-s8jx2 1/1 Running 0 23s 10.40.0.3 kind-worker nginx-deployment-56cc8569bd-wh4wv 1/1 Running 0 23m 10.38.0.1 kind-worker2
The result was surprising at first. Rolling upgrade strategy allowed Kubernetes to terminate one running pod before starting the new one, so pod anti-affinity can't block pod deployment anymore.
Seems I did it! Just a few tweaks and it should go to production :D. This workload is a critical one, so I gave a priorityClassName and a bunch of tolerations. Pod priority defines how important are the pod (Kubernetes terminates less important first in the case of pressure) and toleration allows to schedule pod to none healthy nodes. (I suggest implementing a graceful termination also based on your needs)
kubectl apply -f https://gist.githubusercontent.com/mhmxs/8a2a95bd9e0c86051c6ea125a9dc6c00/raw/3ef7e4da65025da204dba944461634d977e53ec2/deploymentwithPAAAndStrategyAndToleration.yaml deployment.apps/nginx-deployment configured
kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-865777c646-2vpqr 1/1 Running 0 3m15s 10.38.0.1 kind-worker2 nginx-deployment-865777c646-5jc4m 1/1 Running 0 3m23s 10.32.0.3 kind-control-plane nginx-deployment-865777c646-tm6cq 1/1 Running 0 3m23s 10.40.0.3 kind-worker
If you think this solution is not "Cloud Native" at all, you might be right. But keep in mind, I'm a storage developer and customer data is the first place for me. With this solution, we are able to monitor volumes on the cluster per node, and control when and how Kubernetes is able to drain data.