Important operational aspects in a Kubernetes deployment
I use Kubernetes to deploy most of my applications. Writing a deployment seems simple at first sight: take a Docker image, set some environment variable, give the expected number of replicas, and apply the manifest. There are however some items that greatly improve reliability, especially while the application is upgraded:
Resources
An aspect that is often mentionned is resources. Pods should declare the memory and CPU they will require in order for Kubernetes to take the best scheduling decisions.
For that there are two things to confiure:
- requests: the minimal amount of the resource that the pod needs to operate
- limits: the maxumum amount of the resource that we allow the pod to use
Limits are enforced by Kubernetes.
A pod is throttled by the Linux kernel to the CPU limit and it is killed if it tries to use more memory than its limit.
It is thus good to read /sys/fs/cgroup/memory/memory.limit_in_bytes
that contains the actual limit, or use a runtime that automatically reads this file like Java’s UseContainerSupport.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: myapp
resources:
requests:
memory: "100Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "2000m"
Topology spread
Something less often documented is topology spread constraints. It is great to replicate pods for reliability, but if all those pods end up on the same physical server they will be at the mercy of a single crash. Kubernetes can instead be instructed to distribute pods evenly on its network.
For that it uses labels put on nodes, which are automatically set by cloud providers. I setup two constraints on availability zone and node name, to make sure that pods are evenly distributed across zones and that even inside a zone they are spread on different nodes.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
template:
metadata:
labels:
app: myapp
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: myapp
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: myapp
Since the constraints are expressed with pod labels, it means that during rollout new pods will tend to be created on nodes that were not running the previous app version (because previous version had the same labels).
Static assets
For webapps I don’t use CDN’s but instead serve statics from the application server because I avoids having yet another service and it makes dev/prod storage more similar. This means new pods might have to serve statics for old versions of the app, or old pods might have to serve statics for new versions of the app. It happens during the rolling upgrade, or shortly after the upgrade while users navigate on the app without refreshing.
The solution I use is a persistent volume shared by all pods on which I store statics of all application versions, sort of an in-house cheap CDN.
I add a ReadWriteMany
persistent volume (cloud providers offer that) in which I store statics of all app versions.
At startup the app copies statics into this folder before starting the web server.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: myapp-pvc
spec:
accessModes:
- ReadWriteMany
storageClassName: xxx
resources:
requests:
storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
initContainers:
- name: copy-statics
image: myapp:1.2.3
command: ['cp', '-rv', '/srv/static', '/srv/shared']
containers:
- name: myapp
image: myapp:1.2.3
volumeMounts:
- name: shared
mountPath: /srv/shared
subPath: public
env:
- name: STATICS_FOLDER
value: /srv/shared
volumes:
- name: shared
persistentVolumeClaim:
claimName: myapp-pvc
The volume grows on every deploy, at some point I will have cleanup to perform, but right now it grows so slowly that it is ok.
Note that on my cp
command I use the -v
flag because I want my logs to show which files were modified on a given deploy.
Probes
Now that the app runs, another aspect that is frequently mentionned is probes. These are commands or HTTP endpoints that the Kubelet will periodically check in order to see if the application is actually running. There are three different types of probes.
All deployments should have a liveness probe. This one is used by Kubernetes to decide when to reboot the pod. Rebooting is the first measure tried when something goes wrong: by rebooting we move the application out of an undeterminate state to its initial state which is hopefully better.
Deployments exposed via a service should have a readiness probe. This one is used by Kubernetes to decide whether to include the pod in the service endpoints. A pod passing readiness will be used to serve requests, while one failing will be excluded. The readiness probe could have a higher frequency than liveness because we want to remove misbehaving pods from endpoints quite fast but we might give them a chance to recover without restarting.
Pods which are slow to start should have a startup probe. This one is executed by Kubernetes before the two other types kick in. I use something identical to the liveness probe except with a much larger tolerance to failures: it will fail for some time while the pod boots, and once it succeeds will no longer be checked because liveness took over.
Processing behind readiness should checks that the application is actually working, not just reply “ok”. For a web application, merely replying “ok” will only ensure that the HTTP layer works. If the application issues a ping to the database this will additionally prove that connection parameters and network are correctly configured too! But at the same time this endpoint needs not consume too much resources because it ends up being invoked quite often.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: myapp
startupProbe:
httpGet:
path: /health_check
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 30
livenessProbe:
httpGet:
path: /health_check
port: 8080
periodSeconds: 10
readinessProbe:
httpGet:
path: /health_check
port: 8080
periodSeconds: 5
Shutdown
Finally when the application shutsdown, it must not cause failures because it quits too early. This frequently happens with web applications.
HTTP frameworks are designed to let the application finish serving started requests when they receive the signal to stop. But Kubernetes is a distributed system and some node might still point requests to a stopping pod because it has not been notified that the pod was shutting down. The idea is thus to give some time to Kubernetes to propagate the node shutdown information (by removing the pod from service endpoints).
This can be done with a preStop
lifecycle hook:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
template:
spec:
terminationGracePeriodSeconds: 20
containers:
- name: myapp
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
With that when shutdown starts, Kubernetes immediately removes the pod from endpoints and issues the command. The command will do nothing for 10 seconds, during which the pod will still serve requests coming from nodes not yet aware of the removal. Only after the command succeeds the pod is effectively killed.
terminationGracePeriodSeconds
must be made sufficient so that both preStop
and normal shutdown can fit, otherwise the pod will be terminated and not given a chance to have a graceful shutdown (like committing some ongoing work).
If the image that runs the application does not have a shell, using a sidecar container with busybox
can be helpful and not consume too much resources.
The timeout is to be adjusted depending on the sort of things relying on it.
If an AWS load balancer is directed to the service via ingress, its deregistration delay must be smaller than preStop
duration.
Conclusion
Here are the 5 items that I consider necessary for a application to be reliable on Kubernetes. They are mostly targetted at upgrades because this is a crucial moment to ensure that ongoing users do not see any interruption of service. There are obviously many other aspects to consider on the application itself (upgrading database, HTTP service backward compatibility, …)!