High availability, scalability, and capacity planning

Highly available systems must also be scalable. The load on most complicated distributed systems can vary dramatically based on the time of day, weekdays versus weekends, seasonal effects, marketing campaigns, and many other factors. Successful systems will have more users over time and accumulate more and more data. That means that the physical resources of the clusters—mostly nodes and storage—will have to grow over time too. If your cluster is under-provisioned, it will not be able to satisfy all the demand and it will not be available because requests will time out or be queued up and not processed fast enough.

This is the realm of capacity planning. One simple approach is to over-provision your cluster. Anticipate the demand and make sure you have enough of a buffer for spikes of activity. But be aware that this approach suffers from several deficiencies:

  • For highly dynamic and complicated distributed systems, it's difficult to forecast the demand even approximately.
  • Over-provisioning is expensive. You spend a lot of money on resources that are rarely or never used.
  • You have to periodically redo the whole process because the average and peak load on the system changes over time.

A much better approach is to use intent-based capacity planning where high-level abstraction is used and the system adjusts itself accordingly. In the context of Kubernetes, there is the horizontal pod autoscaler (HPA) that can grow and shrink the number of pods needed to handle requests for a particular service. But, that works only to change the ratio of resources allocated to different services. When the entire cluster approaches saturation, you simply need more resources. This is where the cluster autoscaler comes into play. It is a Kubernetes project that became available with Kubernetes 1.8. It works particularly well in cloud environments where additional resources can be provisioned via programmatic APIs.

When the cluster autoscaler (CA) determines that pods can't be scheduled (that is, they are in the pending state), it provisions a new node for the cluster. It can also remove nodes from the cluster if it determines that the cluster has more nodes than necessary to handle the load. The CA will check for pending pods every 30 seconds. It will remove nodes only after 10 minutes of not being used, to avoid thrashing.

Here are some issues to consider:

  • A cluster may require more nodes even if the total CPU or memory utilization is low due to control mechanisms like affinity, anti-affinity, taints, tolerations, pod priorities, and pod disruption budgets.
  • In addition to the built-in delays in triggering the scaling up or down of nodes, there is an additional delay of several minutes when provisioning a new node from the cloud provider.
  • The interactions between the HPA and the CA can be subtle.

Installing the cluster autoscaler

Note that you can't test the CA locally. You must have a Kubernetes cluster running on one of the following supported cloud providers:

  • GCE
  • GKE
  • AWS EKS
  • Azure
  • Alibaba Cloud
  • Baidu Cloud

I have installed the CA successfully on GKE as well as AWS EKS.

The eks-cluster-autoscaler.yaml file contains all the Kubernetes resources needed to install the CA on EKS. It involves creating a service account and giving it various RBAC permissions because it needs to monitor node usage across the cluster and be able to act on it. Finally, there is a deployment that actually deploys the CA image itself with a command-line interface that includes the range of nodes (that is, the minimum and maximum number) it should maintain, and in the case of EKS, a node group is needed too. The maximum number is important to prevent a situation where an attack or error causes the CA to just add more and more nodes uncontrollably, racking up a huge bill. Here is a snippet from the pod template:

    spec: serviceAccountName: cluster-autoscaler
      containers: - image: k8s.gcr.io/cluster-autoscaler:v1.2.2
        name: cluster-autoscaler
        resources:
          limits:
            cpu: 100m
            memory: 300Mi
          requests:
            cpu: 100m
            memory: 300Mi
        command:
        - ./cluster-autoscaler
        - --v=4         - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false         - --nodes=2:5:eksctl-project-nodegroup-ng-name-NodeGroup-suffix
        env: - name: AWS_REGION
          value: us-east-1 volumeMounts: - name: ssl-certs
          mountPath: /etc/ssl/certs/ca-certificates.crt
          readOnly: true imagePullPolicy: "Always"       volumes: - name: ssl-certs
        hostPath: path: "/etc/ssl/certs/ca-bundle.crt"

The combination of the HPA and CA provides a truly elastic cluster where the HPA ensures that services use the proper amount of pods to handle the load per service, and the CA makes sure that the number of nodes matches the overall load on the cluster.

Considering the vertical pod autoscaler

The vertical pod autoscaler (VPA) is another autoscaler that operates on pods. Its job is to provide additional resources (CPU and memory) to pods that have too low limits. It is designed primarily for stateful services, but can work for stateless services too. It is based on a CRD (custom resource definition) and has three components:

  • Recommender: Watches CPU and memory usage and provides recommendations for new values for CPU and memory requests
  • Updater: Kills managed pods whose CPU and memory requests don't match the recommendations made by the recommender
  • Admission plugin: Sets the CPU and memory requests for new or recreated pods based on recommendations

The VPA is still in beta. Here are some of the main limitations:

  • Unable to update running pods (hence the updater kills pods to get them restarted with the correct requests)
  • Can't evict pods that aren't managed by a controller
  • The VPA is incompatible with the HPA

This section covered the interactions between auto-scalability and high availability and looked at different approaches for scaling Kubernetes clusters and the applications running on those clusters.