High Availability and Disaster Recovery

Designing Highly Available Systems
Failover Mechanisms
Disaster Recovery Planning (DRP)
Geo-Redundancy

Designing Highly Available Systems

High Availability (HA) means the system keeps running continuously, without downtime, even if some servers or components fail.

HA's main goals are:

Minimal Downtime (99.9%, 99.99% availability targets)
Fault Tolerance (system works even when failures happen)
Continuous Service (no interruption for users)

Key Principles of Designing Highly Available Systems

1. Redundancy

Running system on single server has high failure risk
HA design uses multiple servers/components
Examples:
- Multiple web server replicas behind load balancer
- Database replication (master-slave / leader-follower)

2. Load Balancing

Distribute user requests across multiple servers
If one server fails, load balancer sends traffic to other healthy servers
Tools: NGINX, HAProxy, AWS ELB, GCP Load Balancer

3. Failover Mechanisms

If one component fails, system automatically switches to backup resource
Examples:
- Primary DB fails → traffic redirects to replica DB
- Active-Passive Clustering

4. Replication

Maintain data copies across multiple servers/regions
Types:
- Synchronous replication → strong consistency, but slow
- Asynchronous replication → fast, but slight lag (eventual consistency)

5. Monitoring & Auto-healing

System should have continuous monitoring
If any node fails, automated recovery should happen
Tools: Prometheus, Grafana, Kubernetes health checks, CloudWatch

6. Avoiding Single Point of Failure (SPOF)

If one component failure brings down entire system → SPOF
In HA design, every critical component should be redundant

7. Geographic Distribution

For regional failures (like data center crash), system is deployed in multiple regions
Example: Netflix's architecture → AWS multiple regions with failover

Availability Targets (SLA)

Availability	Downtime per Year
99% (Two nines)	~3.65 days
99.9% (Three nines)	~8.76 hours
99.99% (Four nines)	~52.6 minutes
99.999% (Five nines)	~5.26 minutes

More 9s means more reliable and costly infrastructure.

Real-Life Examples

Banking Systems → HA is essential because downtime = money loss
E-commerce (Amazon, Flipkart) → downtime during sales = huge loss
Cloud Services (AWS, GCP, Azure) → multiple region replication + failover

Summary

Designing Highly Available Systems means building a system that tolerates failures and keeps service uninterrupted. This is achieved through redundancy, replication, failover, load balancing, monitoring and multi-region deployment.

Failover Mechanisms

Failover mechanism is an automated process that shifts workload or services to a backup system, server, network, or component when system failure happens, so that downtime doesn't happen or is minimized.

👉 This means: If a server/service fails, then automatically another healthy backup system takes its place.

Failover Components

Primary System – Normally handles workload
Secondary (Backup) System – Stays in standby mode
Monitoring System (Heartbeat/Health Checks) – Continuously monitors primary system
Failover Trigger – When monitoring system detects failure, it activates secondary

Types of Failover Mechanisms

Cold Failover (Manual Restart)
- Backup system is completely idle
- Manual intervention needed for restart when failure happens
- Example: Manual DB recovery from backup
Warm Failover
- Backup system is running but no real-time sync
- Switches with some delay when failure happens
- Example: Standby database replication (with lag)
Hot Failover (Active-Passive)
- Backup system is on standby and continuously syncs
- Failover is almost instant
- Example: AWS RDS Multi-AZ Deployment
Active-Active Failover
- Multiple systems are simultaneously active
- If one fails, another automatically takes workload without downtime
- Example: Clustered application servers behind load balancer

Real-Life Examples of Failover Mechanisms

Databases:

PostgreSQL streaming replication + Patroni failover.
MySQL Group Replication.
MongoDB Replica Sets.

Cloud Services:

AWS Elastic Load Balancer (ELB) failover.
Google Cloud SQL failover replicas.
Azure Availability Zones.

Networking:

DNS Failover (Route53 Health Checks).
VRRP (Virtual Router Redundancy Protocol) for routers.

Analogy (Easy to Understand)

Think about your car having a spare wheel:

If one tire gets punctured → you need to install the spare wheel (failover).
If spare wheel is always inflated and ready → hot failover.
If spare wheel is punctured and you need to repair it → cold failover.
If car has automatically 2 tires working at same time and one fails but car keeps running → active-active failover. 🚗

Summary

Failover = automatic switch to backup system on failure.
Types: Cold, Warm, Hot, Active-Active.
Key for High Availability & Reliability.
Real-world usage: databases, cloud infra, DNS, networking.

Disaster Recovery Planning (DRP)

Disaster Recovery Planning (DRP) is a structured strategy where organizations create a plan to restore their IT infrastructure, data, and services if any major failure or disaster happens.

Disaster can mean:

🌪️ Natural: Flood, Earthquake, Fire
⚡ Technical: Data Center Crash, Hardware Failure, Cyberattack, Ransomware
👨‍💻 Human Error: Accidental deletion, Misconfiguration

DRP goals are:
➡️ Minimize downtime
➡️ Minimize data loss
➡️ Resume business quickly

Core Concepts of DRP

RTO (Recovery Time Objective)
- How much time the system should take to recover after disaster.
- Example: Banking app RTO = 30 minutes.
RPO (Recovery Point Objective)
- How much data loss can be tolerated.
- Example: Stock trading app RPO = 1 second (almost no data loss allowed).
Backup Strategy
- Regular backups (daily, hourly, real-time).
- Backup locations: On-site, Off-site, Cloud.
Replication
- Data replication across multiple regions (synchronous/asynchronous).
- Example: AWS Aurora Global Database.
Failover & Failback
- Failover → Primary system down → switch to backup system.
- Failback → When primary recovers → switch back.

DRP Approaches

Cold Site
- Backup location exists but infrastructure is not ready.
- Cheap, but takes more time for recovery.
Warm Site
- Partial infrastructure is ready (backup servers, limited data).
- Moderate cost, faster recovery.
Hot Site
- Fully replicated running infrastructure at another location.
- Expensive but instant failover possible.

Real-life Example

Imagine Amazon AWS Region Outage happens.

If AWS all services in that region go down → Companies like Netflix have DRP:
- RPO = 0 (no data loss due to global replication).
- RTO = few seconds (traffic auto-shifts to another AWS region).

Summary

Disaster Recovery Planning (DRP) is a roadmap that includes:

RTO & RPO definitions
Backup & Replication setup
Failover mechanisms design
Testing & drills performance

This ensures that critical business services can survive and recover even after disasters.

Geo-Redundancy

Geo-Redundancy means deploying your system or application across multiple geographical locations (data centers, cloud regions, availability zones), so that if one location fails (power outage, natural disaster, network issues), another location can seamlessly continue working.

This is a core technique for High Availability (HA) and Disaster Recovery (DR).

Example

If a banking system runs only in Mumbai data center and flood comes there → system down.
If the same system is replicated in Mumbai + Delhi + Singapore data centers → if Mumbai goes down, users will continue getting service from Delhi/Singapore.

Components of Geo-Redundancy

Data Replication
- Synchronous replication → Real-time copy in multiple regions (low latency required, expensive).
- Asynchronous replication → Data copies with slight delay (cost-effective, slightly higher RPO).
Load Balancing & Traffic Routing
- DNS-based routing (GeoDNS, Anycast).
- Cloud load balancers (AWS Route 53, GCP Cloud DNS, Azure Traffic Manager).
Failover Mechanism
- If one region fails, traffic automatically shifts to another region.
Consistency Models
- Strong consistency vs eventual consistency (based on CAP theorem trade-offs).

Real-Life Use Cases

Netflix → replicates its video content on worldwide CDN (Content Delivery Network).
Banks → replicate their transaction systems across multiple regions.
E-commerce (Amazon, Flipkart) → if one data center goes down, cart, payment and orders continue from another.

🔹 Diagram (High-Level Idea)

 User (India) → Nearest Region (Mumbai)
                  ↓ Failover
   User (India) → Backup Region (Singapore)
                  ↓ Failover
   User (India) → Secondary Backup (US-East)

Summary

Geo-Redundancy = Deploying system across multiple geographical regions + data replication + smart traffic routing → to ensure high availability, low downtime, disaster recovery.

Table of Contents​

Designing Highly Available Systems​

Key Principles of Designing Highly Available Systems​

Availability Targets (SLA)​

Real-Life Examples​

Summary​

Failover Mechanisms​

Failover Components​

Types of Failover Mechanisms​

Real-Life Examples of Failover Mechanisms​

Analogy (Easy to Understand)​

Summary​

Disaster Recovery Planning (DRP)​

Core Concepts of DRP​

DRP Approaches​

Real-life Example​

Summary​

Geo-Redundancy​

Example​

Components of Geo-Redundancy​

Real-Life Use Cases​

🔹 Diagram (High-Level Idea)​

Summary​

Table of Contents

Designing Highly Available Systems

Key Principles of Designing Highly Available Systems

Availability Targets (SLA)

Real-Life Examples

Summary

Failover Mechanisms

Failover Components

Types of Failover Mechanisms

Real-Life Examples of Failover Mechanisms

Analogy (Easy to Understand)

Summary

Disaster Recovery Planning (DRP)

Core Concepts of DRP

DRP Approaches

Real-life Example

Summary

Geo-Redundancy

Example

Components of Geo-Redundancy

Real-Life Use Cases

🔹 Diagram (High-Level Idea)

Summary