Important operational aspects in a Kubernetes deployment

I use Kubernetes to deploy most of my applications. Writing a deployment seems simple at first sight: take a Docker image, set some environment variable, give the expected number of replicas, and apply the manifest. There are however some items that greatly improve reliability, especially while the application is upgraded:

resources
topology spread
static assets
probes
shutdown

Resources

An aspect that is often mentionned is resources. Pods should declare the memory and CPU they will require in order for Kubernetes to take the best scheduling decisions.

For that there are two things to confiure:

requests: the minimal amount of the resource that the pod needs to operate
limits: the maxumum amount of the resource that we allow the pod to use

Limits are enforced by Kubernetes. A pod is throttled by the Linux kernel to the CPU limit and it is killed if it tries to use more memory than its limit. It is thus good to read /sys/fs/cgroup/memory/memory.limit_in_bytes that contains the actual limit, or use a runtime that automatically reads this file like Java’s UseContainerSupport.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
      - name: myapp
        resources:
          requests:
            memory: "100Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "2000m"

Topology spread

Something less often documented is topology spread constraints. It is great to replicate pods for reliability, but if all those pods end up on the same physical server they will be at the mercy of a single crash. Kubernetes can instead be instructed to distribute pods evenly on its network.

For that it uses labels put on nodes, which are automatically set by cloud providers. I setup two constraints on availability zone and node name, to make sure that pods are evenly distributed across zones and that even inside a zone they are spread on different nodes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  template:
    metadata:
      labels:
        app: myapp
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: myapp
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: myapp

Since the constraints are expressed with pod labels, it means that during rollout new pods will tend to be created on nodes that were not running the previous app version (because previous version had the same labels).

Static assets

For webapps I don’t use CDN’s but instead serve statics from the application server because I avoids having yet another service and it makes dev/prod storage more similar. This means new pods might have to serve statics for old versions of the app, or old pods might have to serve statics for new versions of the app. It happens during the rolling upgrade, or shortly after the upgrade while users navigate on the app without refreshing.

The solution I use is a persistent volume shared by all pods on which I store statics of all application versions, sort of an in-house cheap CDN. I add a ReadWriteMany persistent volume (cloud providers offer that) in which I store statics of all app versions. At startup the app copies statics into this folder before starting the web server.

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: myapp-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: xxx
  resources:
    requests:
      storage: 5Gi

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      initContainers:
      - name: copy-statics
        image: myapp:1.2.3
        command: ['cp', '-rv', '/srv/static', '/srv/shared']
      containers:
      - name: myapp
        image: myapp:1.2.3
        volumeMounts:
        - name: shared
          mountPath: /srv/shared
          subPath: public
        env:
        - name: STATICS_FOLDER
          value: /srv/shared
      volumes:
      - name: shared
        persistentVolumeClaim:
          claimName: myapp-pvc

The volume grows on every deploy, at some point I will have cleanup to perform, but right now it grows so slowly that it is ok. Note that on my cp command I use the -v flag because I want my logs to show which files were modified on a given deploy.

Probes

Now that the app runs, another aspect that is frequently mentionned is probes. These are commands or HTTP endpoints that the Kubelet will periodically check in order to see if the application is actually running. There are three different types of probes.

All deployments should have a liveness probe. This one is used by Kubernetes to decide when to reboot the pod. Rebooting is the first measure tried when something goes wrong: by rebooting we move the application out of an undeterminate state to its initial state which is hopefully better.

Deployments exposed via a service should have a readiness probe. This one is used by Kubernetes to decide whether to include the pod in the service endpoints. A pod passing readiness will be used to serve requests, while one failing will be excluded. The readiness probe could have a higher frequency than liveness because we want to remove misbehaving pods from endpoints quite fast but we might give them a chance to recover without restarting.

Pods which are slow to start should have a startup probe. This one is executed by Kubernetes before the two other types kick in. I use something identical to the liveness probe except with a much larger tolerance to failures: it will fail for some time while the pod boots, and once it succeeds will no longer be checked because liveness took over.

Processing behind readiness should checks that the application is actually working, not just reply “ok”. For a web application, merely replying “ok” will only ensure that the HTTP layer works. If the application issues a ping to the database this will additionally prove that connection parameters and network are correctly configured too! But at the same time this endpoint needs not consume too much resources because it ends up being invoked quite often.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
      - name: myapp
        startupProbe:
          httpGet:
            path: /health_check
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 10
          failureThreshold: 30
        livenessProbe:
          httpGet:
            path: /health_check
            port: 8080
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health_check
            port: 8080
          periodSeconds: 5

Shutdown

Finally when the application shutsdown, it must not cause failures because it quits too early. This frequently happens with web applications.

HTTP frameworks are designed to let the application finish serving started requests when they receive the signal to stop. But Kubernetes is a distributed system and some node might still point requests to a stopping pod because it has not been notified that the pod was shutting down. The idea is thus to give some time to Kubernetes to propagate the node shutdown information (by removing the pod from service endpoints).

This can be done with a preStop lifecycle hook:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 20
      containers:
      - name: myapp
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 10"]

With that when shutdown starts, Kubernetes immediately removes the pod from endpoints and issues the command. The command will do nothing for 10 seconds, during which the pod will still serve requests coming from nodes not yet aware of the removal. Only after the command succeeds the pod is effectively killed.

terminationGracePeriodSeconds must be made sufficient so that both preStop and normal shutdown can fit, otherwise the pod will be terminated and not given a chance to have a graceful shutdown (like committing some ongoing work). If the image that runs the application does not have a shell, using a sidecar container with busybox can be helpful and not consume too much resources.

The timeout is to be adjusted depending on the sort of things relying on it. If an AWS load balancer is directed to the service via ingress, its deregistration delay must be smaller than preStop duration.

Conclusion

Here are the 5 items that I consider necessary for a application to be reliable on Kubernetes. They are mostly targetted at upgrades because this is a crucial moment to ensure that ongoing users do not see any interruption of service. There are obviously many other aspects to consider on the application itself (upgrading database, HTTP service backward compatibility, …)!