Disrupting OpenShift — Part 1

4 min readJul 6, 2019

Introduction

OpenShift is an amazing container platform, providing a Kubernetes platform that can be deployed on premises or on a public Cloud environment. See more information on https://www.openshift.com/

Now with OpenShift 4.1, it became super straight forward to deploy a cluster. You simply give the installer the access to your Cloud environment, and, voilà, the installer does everything: creates the VMs, installs and configures the product, and makes it ready to use. Amazing!

With the use of Kubernetes Operators, OpenShift 4.1 goes even one step further of making the configuration and scalability even more trivial: you simply defines a certain Kubernetes Custom Resource (Machine), and OpenShift will do the corresponding action of creating the VM and configuring it as a node.

Now in the light of all this automation and resilience, the question that comes to my mind is, “what happens when I start disrupting OpenShift?”

So I decided to do a series of disruption tests to see how OpenShift would recover from a failure.

In this article, I will describe the first test, then will describe tests in other articles (keeping it simple…)

Architecture

I deployed a typical OpenShift 4.1 (on AWS) with the following configuration:

3 masters
3 nodes

You can see the nodes in the following output:

Test 1: Destroying a master

The first test is to answer the question: “What happen if I destroy a master?”

There are a few ways to do it: destroying the VM, making it inaccessible from the other masters, etc. I decided to ride on the Kubernetes custom resource wagon to destroy the machine custom resource.

So let’s find all the machines by running the following command:

oc get machines -n openshift-machine-api

So I will destroy a master node by running the following command:

oc delete machine <master-1>

So, what happens after I destroyed the master machine?

Nothing.

As you can see in the output below, OpenShift did not create a new master, so the cluster is operating with just 2.

What happened

Well, I asked OpenShift to delete the machine, and I did.

Like a Kubernetes Pod, a Machine doesn’t have a recovery mechanism. So, when it dies, it’s dead.

In order to provide resilience for Machine, you need to use a MachineSet (like what a ReplicaSet is for a Pod).

So, let’s look at the MachineSets in the environment:

You see that there are MachineSets for the workers in the different AWS Availability Zones, but not for the master.

I guess the master don’t need scalability (there are always 3), so OpenShift decided not to create a MachineSet for the master.

Conclusion

In this first experiment, we saw that when we delete the Machine associated with a master, OpenShift doesn’t recreate it.

Well, life is not perfect. The cluster continues operational, but very risky. If we lose another master (for any reason), etcd will stop working, and consequently OpenShift.

In the next article, I will describe how to recover from this situation.

Stay tuned!

Bring your plan to the IBM Garage.
Are you ready to learn more about working with the IBM Garage? We’re here to help. Contact us today to schedule time to speak with a Garage expert about your next big idea. Learn about our IBM Garage Method, the design, development and startup communities we work in, and the deep expertise and capabilities we bring to the table.

Schedule a no-charge visit with the IBM Garage.