Distributed Systems

System Design Primer

More advanced: ext: Microservice Design Patterns Course

DNS Service

  • DNS takes a URL and returns an IP

Database

  • NoSQL doesn't offer joins
  • Use NoSQL for unstructured data, data flows where you're only serializing and deserializing data, or storing massive amounts of data

Database Replication

  • It's common to have multiple copies of your db to enable failover

Master-slave replication

  • A master db will only support writes. Slave dbs will get copies of the data from the master and only support reads
    • You usually have more slaves than masters since reads are more common

Failures

  • If only one slave is available and it goes offline, reads will be temporarily directed to the master until a new slave can be found
  • If the master goes offline, a slave will be promoted

Sharding

  • Horizontal scaling of databases
  • Each database is known as a shard. Instead of replicating the db, it's split up over multiple databases
  • We can use a basic hash function to decide which database to use for a user: ex userid % numdbs

Issues

  • Resharding - a single shared can no longer hold any more data
  • Celebrity problem - Ex. multiple celebrities twitter accounts end up on the same shard
  • It makes it hard to perform joins

Vertical vs. Horizontal Scaling

  • Vertical if you have low traffic
  • Vertical scaling doesn't allow for failover and has a hard limit

Load Balancer

  • Lets you distribute traffic among servers
  • The load balancer communicates with the servers via a private IP

Cache

CDN

CDNs cache static content like CSS, JS files

Ex. user in hong kong requests the data from the CDN. It isn't there, so we pull it from S3 We then cache the data in the CDN

Considerations

  • Set an expiration time

Stateless Servers

  • It's recommended to keep user state data in a shared DB

Geo Routing

  • You can split your whole web server operation between geographical regions

    Ex. you have copies of your web servers, dbs, and caches over different geographical regions

Considerations

  • You may need copies of your data in each region in case one data center goes out

Message Queues

Supports asynchronous communication via a producer publishing messages, and a consumer that subscribes to the message queue consuming them

Back of the Envelope Estimation

Power of 2 Value Name
10 1k Kilobyte
20 1M Megabyte
30 Billion Gigabyte
40 Trillion Terabyte
50 Quadrillion Perabyte

L1 Cache - in the CPU L2 Cache - L3 Cache -

Latency numbers - page 36

Framework for System Design Interview Questions

  • Don't over-engineer
  • Always keep the interviewer in the loop of what you're thinking
  1. Understand the problem and establish design scope

    • Don't give quick answers. Think through and fully understand the requirements
      • What features are we going to build?
      • How many users?
      • How fast does the company anticipate to scale?
      • What's the tech stack?
      • What are the most important features?
      • What's the traffic volume?
  2. Propose a high-level design and get their feedback

  3. Design deep dive

    Focus on bottlenecks. Some interviewers want you to focus on high-level design

  4. Wrap Up

    • Discuss potential improvements, give a recap

Design a Rate-Limiter

Controls the rate of traffic sent by a client or a service

Ex. Number of accounts from the same ip, number of writes per second

Client-Side Requests can be forged by malicious actors

  • So we should do it server side
  • The Rate-Limiter should sit between the client and servers and throw HTTP errors

API Gateways are managed services that provide rate-limiting

Rate Limiting

We use Redis to store data, since it's fast and has INCR - increment and EXPIRE

Distributed Rate-Limiter

  • If two requests concurrently read the counter before writing back, they will both incremented it by one
  • Or we may need multiple rate-limiter servers
  • We can have two clients with two rate limiters, both using a shared Redis store
  • Synchronize data with an eventual consistency model

Performance Optimization

  • Multi-data centers are crucial because latency will be high for users far geographically from the data center

Design Consistent Hashing

  • A technique to hash requests evenly across servers

    Basic Technique: hash(server_key) % n This fails, however, when servers are added and removed

  • Instead, let's picture a hash ring, where the hash space from 0 to 2160 - 1 is connected in a circle

  • Now, we put our servers evenly spaced out on the ring

  • To determine which server a key goes to, we start at the hash position and go forward until a key is found

Adding a Server

  • Add it between s0 (server 0) and s1, then s1 and s2,

Problems with this Approach

  • It's impossible to keep the servers evenly-spaced
  • It's possible to have non-uniform key distribution - lots of data mapped to the same server

Virtual Nodes - a Better Approach

Virtual Nodes - each server has multiple virtual nodes on the ring. Because there's a higher count per server, the spacing becomes more even

Partitions of the ring

Design a Key-Value Store

Single-Server approach: Store key-value pairs in a hash table that keeps everything in RAM

Optimize by: Storing the most-used data on RAM and the rest on disk

  • Data compression

CAP Theorem

We thus need a consistent hashing algorithm to spread the traffic

  • We should spread our replicas over various data centers in different geographic regions

Consistency

We need to keep data in sync over various replicas

N - number of replicas W - Write Quorum. 1 means that each node must receive confirmation from 1 node that The data was send R - Read Quorum - The number of responses a read must wait for

  • Strong consistency - any read is the most recent write
  • Weak consistency
  • Eventual consistency - give it time

If we have two writes on different servers that modify the same data:

  • we use a vector clock to determine which came first - this stores server id and version

Handling Failures

  • If a server goes down

Detecting failures

  • If two servers say that a server is down, then we trust it

Gossip Protocol

  • Each node maintains a node membership list - contains member IDs (other nodes) and heartbeat counts
  • Each node periodically sends a heartbeat. If a heartbeat counter is lagging, the node is down
  • Once a node notices that another is down, it sends heartbeats containing s2's info to random nodes

Sloppy Quorum - Temporary Failures

  • A technique for high availability

    The system chooses the first W and R available servers for reads and writes

    If a server is down, another will temporarily process requests

    • When the server comes back, the temporary server will hand off that data

Handling Permanent Failures

Compare each piece of data on the replicas and update each replica to have the newest version

Reads & Writes

Reads

  • Go through a memory cache first, then a bloom filter (if not present in cache) to determine which SST holds the data

Writes

  • Go into the WAL (write-ahead-log), then memory cache, then SST

Full Design

  • A coordinator node coordinates data from client to servers using consistent hashing
  • Maintain heartbeats between nodes to keep servers up to date

Goals - Techniques

Goal Technique
Big Data Consistent Hashing
High Availability Reads Data Replication
High Availability Writes Vector Clocks
Dataset Partitioning Consistent Hashing
Tunable Consistency Quorum Consensus