Designing Reliable Systems
Table of Contents
Redundancy and Replication
Redundancy
Definition: Duplicate components or backup systems so that if one fails, the system doesn't completely shut down. Purpose: Fault tolerance and high availability.
Types:
- Hardware Redundancy: Multiple servers, disks (RAID setup)
- Network Redundancy: Multiple internet connections, multiple load balancers
- Power Redundancy: Backup generators, dual power supply
Replication
Definition: Making copies of data and storing them on multiple locations or servers. Purpose: Performance and data availability. Example: DB replication with primary DB and multiple read replicas.
Types:
- Synchronous Replication: Data updates on every node simultaneously (more consistency but higher latency)
- Asynchronous Replication: Updates on primary first, then syncs to replicas (faster but there could be a delay)
Goal: High availability and disaster recovery (DR)
Consistency Models
In system design, consistency models define how distributed systems guarantee data updates after data reads.
When a system runs on multiple nodes, it becomes difficult for data to reach all nodes at the same time. Therefore, different consistency models are created where the system decides how read/write operations behave.
Types of Consistency Models
Strong Consistency:
- When data updates, every node immediately shows the latest data
- Example: When you withdraw $500 from bank, data gets updated immediately in ATMs and other bank branches
Eventual Consistency:
- Not immediately, but after a little delay data gets updated
- Example: Social media likes. You get count of likes in a couple of seconds or minutes
Causal Consistency:
- If one event depends on another event, it should be presented in sync order
- Example: WhatsApp messages - you send "hi" followed by "hello" and so on in order
Read-Your-Writes Consistency:
- If you updated any value, then your read should have that latest value visible
- Example: LinkedIn post update - you change your profile picture and it gets immediately updated for you
Monotonic Reads Consistency:
- If one user reads a value, then next time the older value doesn't exist. Only latest value is visible
Monotonic Writes Consistency:
- Writes are applied on all nodes in order and this is guaranteed
- Example: E-commerce order status "ordered → shipped → out for delivery → delivered"
Summary: Consistency models decide system behavior for data read/write - immediately (strong), with a little delay (eventual), or order-maintained (causal, monotonic)
Fault Tolerance
Definition: Some parts of system may fail but overall system still works without any problems.
This is a design principle where systems are designed so that even if hardware, software, or network fails, it gracefully recovers.
Key Concepts
Redundancy: (refer above)
Replication: (refer above)
Failover Mechanism:
- If one system component fails, traffic automatically switches to backup system
- Example: Primary DB down → system promotes secondary DB and uses it
Graceful Degradation:
- System doesn't completely crash but runs on limited features
- Example: Even if Netflix recommendation engine fails, you can still stream videos
Real-World Scenarios
🏦 Banking Application
ATM Transactions:
- If one branch's ATM server goes down, transactions automatically process through central server or another branch's server
Database Replication:
- Transaction data is written simultaneously to one primary database + one replica (backup) database. If primary fails, replica handles it
🎬 Netflix / YouTube
Content Delivery:
- If one content delivery server fails, CDN automatically switches user to nearest healthy server
☁️ Cloud Systems (AWS, GCP, Azure)
Availability Zones:
- If one availability zone fails, application automatically continues from instances hosted in another availability zone