Six things to consider related to OpenShift resilience
(This article is written in collaboration with Ernese Norelus)
Recently, a co-worker asked me my point of view regarding OpenShift resilience.
There are undoubtedly many aspects and misconceptions related to resilience in OpenShift (or any Kubernetes environment), and here are some off the top of my head.
1. Container resilience is not enough. You also need cluster resilience
Kubernetes (hence OpenShift) provides a lot of resilience for the container. Once a container dies and is part of a Deployment, StatefulSet, ReplicaSet, etc., Kubernetes will restart it.
This characteristic provides resilience for the container (or Pod, in Kubernetes terms), but it might not be enough.
The resilience is provided within the boundary of a cluster. What happens if the cluster dies or is unavailable? What if the cluster runs out of capacity?
The solution is to consider deploying many clusters for high availability. This way, if one cluster dies, your solution can still be available in another cluster. Preferably, the cluster should be deployed to another data center or region, as this is related to my second recommendation.
2. Share nothing across your cluster
The clusters should be completely independent, without sharing any resource.
You might be tempted to share the storage layer or some service outside the clusters. Unless these services provide their high availability solution, a cluster should not share any service with another cluster. This way, we avoid creating a single point of failure.
3. Consider small, simpler clusters
I have been working with Kubernetes for 4 years. Not as long as my employer (IBM) asked in this job description: https://intellijobs.ai/job/IBMCloud-Native-Infrastructure-Engineer-Architect-bvJJ6yraexfWOk1nMRKP-bvJJ6yraexfWOk1nMRKP.
Years ago, we were doing some exotic cluster topologies, with cluster spanning across data centers, many worker nodes, etc. Often, there was a failure in a network device or the storage that compromise the cluster availability.
So, my suggestion is to deploy simple and many clusters. This way you can scale, upgrade, and even destroy the cluster without having to move a mountain.
Deploying a cluster across many zones (availability zones) is a recommended approach, but a cluster should not span across regions or cloud providers.
Combining this recommendation with the previous one will give us a lot of small and independent clusters.
4. Have a Continuous Delivery process to deploy the resources
Containers bring a new way of developing applications: small components (you can call them microservices), developed, and deployed independently.
Along with the containers, OpenShift requires the creation of many Kubernetes resources as part of an application: Deployments, PVCs, Services, etc.
With the proper configuration of OpenShift projects, quotas, Security Context Constraints, you might be tempted to grant users access to the OpenShift environment to indulge in creating these resources.
Such an approach brings a series of problems related to application resilience:
- How to redeploy an application?
- How to back up the application resources?
- How to trace back the application deployment?
A better approach is to limit what the users can do directly into an OpenShift cluster. Ideally, in a production environment, only a Continuous Delivery (CD) process should have the authority to deploy the resources.
This recommendation provides resilience, as it enforces consistency of application deployment. Whether the application is deployed to the same cluster or transitioned to a new one, a CD process will guarantee consistency.
5. Deploy your application in an Active-Active mode
An OpenShift cluster is just a cluster. It sits somewhere eager to run Kubernetes applications. Regarding resilience, what matters is how these applications are deployed across the clusters.
Setting aside all the technicality of the terms (HA, DR, RTO, RPO, 5 9s, etc), I recommend you deploy your application in an active-active mode. This means the application is active in two (or more) clusters, and there is a Load Balancer on the top.
The other way would be an Active-Passive configuration, which always causes some action (whether automated or manual) to activate the Passive side.
Now, to implement an Active-Active stateless application is a piece of cake. Just deploy the application across the clusters and be happy! However, more and more, we see stateful applications in OpenShift, which brings the question of how to share or propagate the data.
Certainly, we don’t want two (or many) segregated deployments, so we need to find a way to have a common data layer.
You might be attempted to use a data layer outside the cluster, exploring the capabilities of your cloud provider. This is a great way, but beware of not locking yourself in a cloud provider either!
So, this leads to my last recommendation below.
6. Design your application considering many clusters upfront
OpenShift makes it simple to create a container-based application, then scale it, using many replicas and autoscaling.
However, many applications hit a wall when moving from a single cluster to many.
If we stick to the principle of Sharing Nothing, we provide a solution for stateful applications deployed to many clusters (stateless applications are a no-brainer).
Take, for example, an application that uses Redis to store some data. Redis works well within a cluster, but I haven’t seen a simple way to deploy a cluster of Redis instances across OpenShift clusters.
So, my recommendation is that you design your application upfront with the notion that it will be deployed across multiple clusters, using components that work well spread across them.
One example is MongoDB, which provides a way to create a cluster of instances across Kubernetes clusters (as I described at https://medium.com/ibm-garage/how-to-implement-mongodb-replication-across-multiple-openshift-clusters-81c216967809).
OpenShift is an exciting platform, and deploying applications to it is even more exciting.
However, as I have seen in the last few years, you need to design the application to consider the reality of many clusters, instead of simply expecting that if “it runs in one cluster, it will run in many.”
Bring your plan to the IBM Garage.
IBM Garage is built for moving faster, working smarter, and innovating in a way that lets you disrupt disruption.
Learn more at www.ibm.com/garage