Validating resilience of a simple Kubernetes application
Introduction
Kubernetes provides a great level of resilience for container-based applications.
If a Pod dies and is part of a ReplicaSet, DaemonSet, or StatefulSet, Kubernetes will restart it without user interaction.
But the question is the following, “Is this feature good enough to make an application highly-available?”
In this blog, I will explore a very simple application (guestbook, described at https://kubernetes.io/docs/tutorials/stateless-application/guestbook/) and see if it provides application resilience.
Guestbook architecture
The guestbook has a very simple architecture, with the following components:
The frontend, written in PHP and JavaScript, uses both the Redis master (for writing) and Redis slaves (to query).
It seems the architecture is pretty solid, as there can be many instances of the Frontend and the Redis slave. The only concern is the single instance for the Redis master.
But before we look at the single point of failure for the Redis master, let’s look at a design problem first.
Concurrency problem
The guestbook application has one flaw: The JavaScript code keeps track of the messages and simply appends the new message to the list of fetched messages.
So if another user opens a browser and inserts a message, and the current user inserts a new message, then other user’s message will be lost:
- Let’s call the users John and Mary and assume the application has no message.
- John opens his browser and points to the guestbook application.
- Mary does the same.
- John inserts the message “Hello from John”.
- Mary inserts the message “Hello from Mary”.
- Mary’s message will overwrite John’s.
A new version of the application is available at https://github.com/patrocinio/guestbook/tree/concurrency.
Testing the application resilience
Now that the application seems stable, let’s test its resilience.
So I created a Node.js application that does the following:
- Clear all the messages from Redis
- Sends 100 messages, one at a time
- Ensure that there are 100 messages at the end
The code can be retrieved from https://github.com/patrocinio/guestbook-test.
The results were as expected: we had the 100 messages correctly persisted in the database.
Now, what will happen if we decide to send these messages at the same time, without waiting for the confirmation?
Testing many requests in parallel
So let’s try to send the requests in parallel and see if the application can successfully process them. The result is the following:
Data: ,1,0,2,3,6,7,9,4,8,11,10,12,13,15,14,16,21,17,19,18,23,20,22,25,24,29,27,26,35,39,36,38,41,42,40,43,44,47,45,46,51,49,48,5,57,59,58,60,61,62,65,63,64,67,66,68,69,70,71,73,50,75,72,76,37,34,79,80,77,78,74,84,85,83,82,81,86,87,31,32,91,89,88,90,95,94,30,28,33,52,53,56,54,55,93,92,96,98,99,97
Even though the messages were not recorded in the order the test application sent, the 100 messages were successfully persisted.
So it seems the application is “ready for production,” and nothing else needs to be done for its resilience.
Right? Wrong.
As I test with 200 messages, the number of messages persisted is consistently short of 200.
Refactor the application
The frontend application is written in JavaScript (running on the client/browser side) and PHP. So far, the component has been simple, but to support the enhancements we need to do, I decided to break it into two microservices: the JavaScript code and a separate backend component.
Here is the new architecture:
Even though PHP is tremendously popular as the language to write web pages, I prefer to use Node.js to write REST-based components.
You can see the implementation at https://github.com/patrocinio/guestbook/tree/backend.
Examining the application
Back to the scenario with 200 messages, as we examine the application closer, we see the following snippet of code:
messages = await retrieveMessages();// [...]const result = await setAsync(MESSAGES_KEY, messages);
So the problem is that between retrieving the messages and appending them, there might be another request that will append another message, which will then be lost.
Here is a sample output:
Add message result: "31"Add message result: "31,21"Add message result: "31"Add message result: "5"Add message result: "31,21,8"Add message result: "31,21,26"Add message result: "31,13"Add message result: "31,21,26"Add message result: "31,21,2"Add message result: "31,24"Add message result: "31,21,8"Add message result: "31,21,8"Add message result: "31,21,2,1"Add message result: "31,21,26,65,7"Add message result: "31,21,26"
To ensure that nobody else is updating the messages after we read them, we need to implement a transaction lock mechanism.
Implementing a locking mechanism
There are a few ways to resolve the issue described above.
In this article, I will discuss a simple one: wrapping the code above in a lock, so that one only backend will fetch and update the messages at a time.
This solution will certainly cause a delay in the response, as just one request can talk to the Redis database at a time.
The solution is available at https://github.com/patrocinio/guestbook/tree/redis_lock_v2.
In my next article, I will discuss a different solution, involving decoupling the frontend and backend, by using a queue.
Conclusion
In this article, I described the evolution of a simple Kubernetes application to achieve resilience.
First, I had to fix a bug to allow multiple users to add messages at the same time. Then, I needed to wrap some code around a lock, so that simultaneous request would not step on each other.
Bring your plan to the IBM Garage.
Are you ready to learn more about working with the IBM Garage? We’re here to help. Contact us today to schedule time to speak with a Garage expert about your next big idea. Learn about our IBM Garage Method, the design, development and startup communities we work in, and the deep expertise and capabilities we bring to the table.
Schedule a no-charge visit with the IBM Garage.