Title: System Design Interview
Author: Alex Xue
Chapter 2: Back-Of-The-Envelope Estimation
This chapter provides some useful data and metrics that could be used to make estimation for various aspect of a system
Chapter 5: Designing Consistent Hashing
Rehashing problem: most hashing function maps a key to an index using
serverIndex = hash(key) % N. However,
when value of
N change due to the more servers (scale up) or some servers failing, the value of
Solution: Ring hashing
- A hash function is used that provides a fixed range (ie SHA1:
[0, 2^160 -1]).
- Form a ring from the start of the hash range to the end of the hash range
- Each server will be assigned a position in the ring.
- For each key, find the position on the hash ring and the assigned server will the first server in the counter clockwise direction
- Adding or removing a server will not change the assigned server for unaffected keys.
- Keys might not be evenly distributed which will cause some server to be overloaded
- Removing adjacent servers could result in a large partition for a particular key.
Mitagtion: virtual nodes
- Instead of having each node mapped to a single position in the ring, each nodes will be mapped to multiple nodes in the ring. This will result in smaller partition which would reduce the additional load on a single node when a node is removed.
Chapter 6: Design a Key-Value Store
CAP theorem: distributed key value must choose two of consistency, availability and partition tolerance.
- Consistency: all clients see the same data at the same time
- Strong consistency: any read operation returns a value corresponding to the result of the most updated write data item
- Weak consistency: subsequent read operations may not see the most updated value.
- Eventual consistency: given enough time, all updates are propagated and all replicas are consistent
- Availability: any clients which request data gets a response even if some of the nodes are down
- Partition Tolerance: a partition indicates a communication break down between two nodes. The node is still operational but only cannot communicate with other nodes.
- Real world: consistency and availability does not exists as partition is a very real problem.
Master Salve (Naive):
- One writer node that have multiple replicated reader node
- Manages to achieve consistency and availability
- If one reader node goes down, other reader node can take over. If master node goes down, the reader node will be promoted to the master.
- As all data are propagated from the writer to reader, there will not be any inconsistent data.
- (klement: How do you enforce linearizability? Reading data that has been written but not yet propagated)
- Disadvantages (not partition tolerance):
- When one of the reader node is partitioned, the it serves might become stale (write operations not propagated)
- Consistency + Partition tolerance modification:
- If a nodes becomes partitioned, the non-partitioned node must block all write operations
- Prevents the data from the partitioned reader becoming stale (inconsistent)
- Disadvantages: unavailable as all write operations are stopped when a node is down
- Availability + Partition tolerance:
- Continue to allow read and write operations to be excuted -> inconsistent data between non-partitioned and partitioned node
- Eventually sync after the partitioned node is back
Data Partition: use consistent ring hash to partition data across multiple nodes with replication
- Essential for high availability: cannot have a single point of failure that could bring down the entire application
- Solution: instead of storing the key in the first node in the clockwise direction of ring hash, store the key in the first N nodes in the clockwise direction of the ring hash.
- Each key will have different set of replica nodes.
- Due to virtual nodes, the first N nodes should be unique physical nodes.
- Synchronizing data across replicas (not writer node)
- (klement: Are all write operations to the nodes the same? Shouldn’t they all be from the same master writer node?)
- Quorum consensus
- N = The number of replicas
- W = write quorum of size W. For a write operation to be considered as successful, write operation must
be acknowledge from
- R = A read quorum of size R. For a read operation to be considered as successful, read operation must wait for responses from at least R replicas
- Each set of replicas will have a single coordinator that acts as a proxy for all read and write operation to the replicas.
- N, W, R:
- R=1 W=N, the system is optimised for fast read
- W=1 R=N, the system is optimised for fast write
- W + R > N, strong consistency. There must at least be a node that saw both the read and write operation => strong consistency
- Replicas not accept new reads/write until all of the nodes agreee on the current write
- Could block new operations => not highly available
- W + R <= N, strong consistency not guaranteed
- Eventual consistency is recommend. Pass the burden of reconciling the eventual value to the client.
Inconsistency resolution: versioning
- Inconsistency happens when we choose non-strong consistency (eventual consistency) for data across replicas
- From the client POV, the data are inconsistent
- Version clock:
- Clock is represented by:
D([S1, v1], [S2,v2], ..., [Sn, vn]). The vector clock stores the version of the data from the POV of each of the node
[Si, vi]if exists
- Otherwise create a new entry
- Clock is represented by:
- Every time the client reads data, they get the data with the version clock
- Every time the client writes data, they will provide the previous clock
- The server will increment its clock value
- If the client gets conflicting data, will compare the vector clock of the two data.
- A vector clock is a parent (irrelevant) if it is pairwise less than or equal to the other data.
- If not comparable -> there is a conflict in data and the client can choose either of the sibling
- Failure detection:
- Naive: all-to-all multicast
- Gossip protocol:
- Each node periodically increments its heartbeat counter
- Each node periodically sends heartbeat to a set of random nodes
- Once nodes receive heart beat, membership (table of number of heartbeat) list is updated to the latest info
- If heart beat is not increased for more than predefined period
- Sends that offline node heart beat to other nodes to confirm
- Handling temporary failure:
- Sloppy Quorum: instead of choosing W and R random nodes, choose the first W and first R nodes. This allows for down nodes to be replaced by the next node in the ring
- Permanent failure:
- Handle replicas being out of synced
- Create a merkle tree for all the data in node and compare it with identical replicas.
- (klement: If a sliding window of replicas is chosen, wont each node have unique set of values and thus different merkle hash?)
(no master/salve model?)
- Coordinator that is a proxy and carries out the quorum algorithm
- Nodes are assigned to hash ring and data is replicated
- Every node has the same set of responsibility
- Write request is persisted on a commit log file
- Data is saved in the memory cache
- If cache is full flush data to disk
- If the key in memory return the data
- If data is not in memory, the system checks the bloom filter
- The bloom filter is used to figure out which SSTable might contain the key
- return the result to the client
- All write path and read path will start from the coordinator to all replicas for the key
- Carry out quorum/sloppy-quorum to determine when to return success/result to the client
- Utilizes gossip to check if the nodes are down