Do you ever wonder if you could keep your application from experiencing any failures or downtime at all?
The response is “No.” You can’t stop every single failure and downtime that could ever occur to an application.
You can improve a few areas to develop a robust system and mindset, so don’t worry.
Systems that recover fast after failure are considered resilient systems. Quick failure recovery does not always imply that if something goes wrong, it will be fixed in a short period of time. If the items could be recovered immediately, they would be recovered. If not, a simple procedure to fix the situation would need to be developed.
How to build a resilient system
There are multiple ways to build a resilient system but in this system, we will discuss 4 points only.
1. Auto healing
There should be some means of detecting downtime if your program or a particular component goes down. After a set amount of time, the program should be able to either continue the task from where it stopped off.
Let’s talk about the ways to introduce auto-healing
- Retry -: You should, if at all possible, have manual or automatic techniques to attempt again things that, for a variety of reasons, were not successful the first time. You can have a few possible retries, for instance, if you are running an eCommerce site and a consumer wants to pay using PayPal but PayPal is down.
- Proper UX: In some circumstances, you can additionally plan the user experience (UX) of the application so that the end user is aware of any unfinished business. so that you can try again with less effort. For instance, anytime you post a video to Facebook, it will let you know after a short while whether the upload was successful or unsuccessful.
You may track the behavior of your application over time by using monitoring. When I use the word "behaving," I mean everything that is necessary for a business to function properly, including major errors, resource consumption, feature utilization, revenue numbers, and cost numbers.
- Logs: If handled properly, logs can be a highly important source of information. They serve as proof that a certain event occurred. For instance, you keep track of every time a consumer tries to pay and when they are unable to pay at a specific time. It can be used to determine
- Metrics: Metrics are a method for quantifying anything. Anything that is significant to you can be measured. For instance, the total number of 4xx status codes, the total number of registered users, the total number of users who were unable to make a payment using PayPal, etc.
- Alerting: Even if you have solid metrics and logs, what if you always need to check to see if something is working or not as expected? The alerting method involves informing you if there is a problem with the expectation you've set. You can set up logs for anytime someone is unable to pay using PayPal so that you are notified through email, Slack, or SMS that the payment has failed for a specific user.
Testing is a method of determining what would occur if you either gave specific data to a specific function or clicked certain before or after the page loaded if a third-party service went down, etc. It involves confirming your hypotheses regarding specific features, techniques, scenarios, and so forth.
Your application can be used or abused in a variety of ways, therefore by validating your assumptions, you are already protecting yourself. It enables you to make wiser decisions on what to do next. Writing units, integration, and, if applicable, end-to-end tests are a few examples.
4. Incident Retros
Failures and other potential negative outcomes are unavoidable, as we’ve already stated. Even when anything goes wrong, there should always be a discussion or report regarding what, when, and how it happened.
On the incident, retros make sure to not blame anyone because things happen not because of the person but because of not having the right process in the first place. After getting insights into how things happened. Try to come to a conclusion on setting up a process to over this kind of uncertainty for next time. For e.g User A was not able to pay because the payment service was down at that time could be a talk. Setting up a retry first on the first failure if payment still fails notify users that their action has failed. Create a simple way for users to retry the payment themselves.
Always remember that failure is inevitable; in light of this, attempt to think of three questions. What would happen, how would it affect us, and how could we handle this if something or some part broke down? With this attitude, you’ll start displaying resilience in some way in your work. Building a robust system takes time and work, and the key is to start small and keep going.