Application Resilience: 4 Ways to Build It

4 min readNov 22, 2022

Do you ever wonder if you could keep your application from experiencing any failures or downtime at all?

The response is “No.” You can’t stop every single failure and downtime that could ever occur to an application.

You can improve a few areas to develop a robust system and mindset, so don’t worry.

Systems that recover fast after failure are considered resilient systems. Quick failure recovery does not always imply that if something goes wrong, it will be fixed in a short period of time. If the items could be recovered immediately, they would be recovered. If not, a simple procedure to fix the situation would need to be developed.

How to build a resilient system

There are multiple ways to build a resilient system but in this system, we will discuss 4 points only.

1. Auto healing

There should be some means of detecting downtime if your program or a particular component goes down. After a set amount of time, the program should be able to either continue the task from where it stopped off.

Let’s talk about the ways to introduce auto-healing

Retry -: You should, if at all possible, have manual or automatic techniques to attempt again things that, for a variety of reasons, were not successful the first time. You can have a few possible retries, for instance, if you are running an eCommerce site and a consumer wants to pay using PayPal but PayPal is down.
Proper UX: In some circumstances, you can additionally plan the user experience (UX) of the application so that the end user is aware of any unfinished business. so that you can try again with less effort. For instance, anytime you post a video to Facebook, it will let you know after a short while whether the upload was successful or unsuccessful.

2. Monitoring

You may track the behavior of your application over time by using monitoring. When I use the word "behaving," I mean everything that is necessary for a business to function properly, including major errors, resource consumption, feature utilization, revenue numbers, and cost numbers.

Logs: If handled properly, logs can be a highly important source of information. They serve as proof that a certain event occurred. For instance, you keep track of every time a consumer tries to pay and when they are unable to pay at a specific time. It can be used to determine
Metrics: Metrics are a method for quantifying anything. Anything that is significant to you can be measured. For instance, the total number of 4xx status codes, the total number of registered users, the total number of users who were unable to make a payment using PayPal, etc.
Alerting: Even if you have solid metrics and logs, what if you always need to check to see if something is working or not as expected? The alerting method involves informing you if there is a problem with the expectation you've set. You can set up logs for anytime someone is unable to pay using PayPal so that you are notified through email, Slack, or SMS that the payment has failed for a specific user.

3. Testing

Testing is a method of determining what would occur if you either gave specific data to a specific function or clicked certain before or after the page loaded if a third-party service went down, etc. It involves confirming your hypotheses regarding specific features, techniques, scenarios, and so forth.

Your application can be used or abused in a variety of ways, therefore by validating your assumptions, you are already protecting yourself. It enables you to make wiser decisions on what to do next. Writing units, integration, and, if applicable, end-to-end tests are a few examples.

4. Incident Retros

Failures and other potential negative outcomes are unavoidable, as we’ve already stated. Even when anything goes wrong, there should always be a discussion or report regarding what, when, and how it happened.

On the incident, retros make sure to not blame anyone because things happen not because of the person but because of not having the right process in the first place. After getting insights into how things happened. Try to come to a conclusion on setting up a process to over this kind of uncertainty for next time. For e.g User A was not able to pay because the payment service was down at that time could be a talk. Setting up a retry first on the first failure if payment still fails notify users that their action has failed. Create a simple way for users to retry the payment themselves.

Conclusion

Always remember that failure is inevitable; in light of this, attempt to think of three questions. What would happen, how would it affect us, and how could we handle this if something or some part broke down? With this attitude, you’ll start displaying resilience in some way in your work. Building a robust system takes time and work, and the key is to start small and keep going.

Application Resilience: 4 Ways to Build It

How to build a resilient system

1. Auto healing

2. Monitoring

3. Testing

4. Incident Retros

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by ujwal dhakal

No responses yet

More from ujwal dhakal

Manage Multiple Cron with Helm Flow Control

If you want to set up a Cron on your application, using a cron in Kubernetes is straightforward. All you need to do is copy the CronJob…

How to use Protobuf with Go?

Have you ever been in a situation where you do not know the structure of data that services are either consuming or publishing?

Realtime sync k8s pod with Okteto

Ever wonder if there are feedback loop cycle that is shorter while working with kubernetes. What if you could sync k8s pod in real-time…

Automate Cloud Run deployment in a minute

Deploying an app has become so easy that you can deploy it to any server with a single click or a simple code push.

Recommended from Medium

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Lists

Staff picks

Stories to Help You Level-Up at Work

Self-Improvement 101

Productivity 101

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

System Design Blueprint: The Ultimate Guide

Developing a robust, scalable, and efficient system can be daunting. However, understanding the key concepts and components can make the…

System Design CheatSheet for Interview

Dear Readers, I am summarizing the commonly asked concepts in system design interviews. These questions are asked in almost all the system…

How I Review Code As a Senior Developer For Better Results

I have been doing code reviews for quite some time and have become better at it. From my experience here I have compiled a list of…