In this tutorial, we are going to discuss about availability, one of the important system design concept. Many of us may have experienced moments where we could not access certain applications due to an outage or unavailability. Recently, YouTube faced a global outage that stopped users from streaming videos for about an hour. You may wonder about the reason and How one can prevent it from happening? Let’s Find out.

What is Availability?

Availability is a measure of how accessible and reliable a system is to its users. Availability means: System should always remain up and return the response of a client request. In other words, whenever any user want to use the service, system must be available and satisfy user request.

Availability is the percentage of time in a given period that a system is available to perform its task and function under normal conditions. One way to look at is how resistant a system is to failures. The percentage of availability that a system requires depends on the usage of the system. Let us take some examples.

Air traffic control systems are one of the best examples of systems that require high availability. In today’s world, air travel is very complicated and busy, and a mistake in aircraft maintenance can lead to disaster. On the other hand, a system that has few visitors and no errors is suffering from lower system availability. Availability is expensive, so you need to optimize it for your needs.

Another example, Streaming services like Netflix, Amazon Prime must ensure their content is available to users at all times. They achieve high availability through redundant infrastructure, load balancing, and distributed data centers.

How is Availability Measured?

The availability of a system is measured as the percentage of a system’s uptime in a given time period or by dividing the total uptime by the total uptime and downtime in a given period of time.

Availability = Uptime ÷ (Uptime + Downtime)
The Nine’s of Availability

The Nine’s of Availability is a commonly used metric to assess system availability, where each nine represents a decimal point. For example, three nines of availability corresponds to 99.9% uptime. This measurement allows organizations to evaluate and optimize availability based on their specific requirements.

In high-demand applications, we usually measure availability in terms of Nines rather than percentages. If availability is 99.00 percent, it is said to have “2 nines” of availability, and if it is 99.9 percent, it is called “3 nines,” and so on. A system with 5 nines (i.e., 99.999%) of availability is said to have a Gold Standard of Availability. Let’s take a look at different Nines of Availability.

How do we achieve High Availability?

High availability is the ability of a system to maintain operation despite the failure of components. To increase availability, we can use redundancy by duplicating or adding additional hardware (servers or storage) components. For example, a system with two identical web servers behind a load balancer can continue operating even if one of the servers goes down, as the load balancer will redirect traffic to the remaining server. So by adding redundancy, we can make the system more resilient to failure.

Passive Redundancy

Here only some of the components (server or storage device) are active at any given time and backup components are available in case of a failure. If some component will fail, the backup component will takes over and becomes active. This will allow system to continue to operate and maintain availability.

Active Redundancy

Here multiple active components (servers or storage devices) work simultaneously to perform the task. In the event of a failure of one of the active components, the other active components can take over and maintain the availability of the system.

It is very important to note that redundancy alone is not enough to guarantee high availability. Failure detection mechanisms must also be in place to identify failures. This requires regular high-availability testing and the ability to take corrective action whenever one of the components in the system becomes unavailable.

There are both hardware and software based approaches to achieving high availability. Redundancy is a hardware-based approach, while other techniques such as top-to-bottom or distributed high-availability approaches can involve both hardware and software. Software-based downtime reduction techniques can also be effective.

To achieve high availability, we often take measures to implement redundancy or disaster recovery strategies, which can hurt other aspects of system performance (higher latency or lower throughput). For example, implementing redundancy may involve replicating data or tasks across multiple resources, which can increase the time it takes to complete a task, resulting in higher latency.

Difference between High Availability and Fault Tolerance

Both high availability and fault tolerance are strategies used to achieve high uptime in systems, but they approach the problem differently. High availability is about system or component’s ability to remain operational and accessible with minimal downtime. On other side, Fault tolerance is about system or component’s ability to continue functioning normally even in the event of a failure.

Fault tolerance involves utilization of multiple systems that run in parallel. In the event of a failure in the main system, another system can take over without any loss of uptime. This requires advanced hardware that can detect component faults and enable the systems to operate in coordination. However, it may take longer for complex networks and devices to respond to malfunctions, and technical issues that result in a system crash may also cause the failure of redundant systems running in parallel, leading to a system-wide failure.

High availability, on the other hand, also uses software-based approach to minimize server downtime rather than relying on hardware redundancy. A high-availability cluster uses a collection of servers together to achieve maximum redundancy. This can be more flexible and easier to implement than a fault-tolerant system, but it may not provide the same level of protection against system failures.

Availability vs Reliability

If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In other words, high reliability contributes to high availability, but it is possible to achieve high availability even with an unreliable system.

That’s all about availability in system design. If you have any queries or feedback, please write us at contact@waytoeasylearn.com. Enjoy learning, Enjoy system design..!!

Scroll to top