We'll continue from where we left off in the previous post.
In this post, I will explain the following system components topics:
As with any large system at scale, failures are not only inevitable but common. Handling failure scenarios is very important. In this section, we first introduce techniques to detect failures. Then, we go over common failure resolution strategies.
Failure detection
In a distributed system, it is insufficient to believe that a server is down because another server says so. Usually, it requires at least two independent sources of information to mark a server down.
As shown in the following figure, all-to-all multicasting is a straightforward solution. However, this is inefficient when many servers are in the system.
A better solution is to use decentralized failure detection methods like gossip protocol. The gossip protocol works as follows:
If the heartbeat has not increased for more than predefined periods, the member is considered as offline.
As shown in the above figure:
Handling temporary failures
After failures have been detected through the gossip protocol, the system needs to deploy certain mechanisms to ensure availability. In the strict quorum approach, read and write operations could be blocked as illustrated in the quorum consensus section.
A technique called “sloppy quorum” is used to improve availability. Instead of enforcing the quorum requirement, the system chooses the first W healthy servers for writes and first R healthy servers for reads on the hash ring. Offline servers are ignored.
If a server is unavailable due to network or server failures, another server will process requests temporarily. When the down server is up, changes will be pushed back to achieve data consistency.
This process is called hinted handoff. Since s2 is unavailable in the following figure, reads and writes will be handled by s3 temporarily. When s2 comes back online, s3 will hand the data back to s2.
Handling permanent failures
Hinted handoff is used to handle temporary failures. What if a replica is permanently unavailable? To handle such a situation, we implement an anti-entropy protocol to keep replicas in sync.
Anti-entropy involves comparing each piece of data on replicas and updating each replica to the newest version. A Merkle tree is used for inconsistency detection and minimizing the amount of data transferred.
Quoted from Wikipedia: “A hash tree or Merkle tree is a tree in which every non-leaf node is labeled with the hash of the labels or values (in case of leaves) of its child nodes. Hash trees allow efficient and secure verification of the contents of large data structures”.
Assuming key space is from 1 to 12, the following steps show how to build a Merkle tree. Highlighted boxes indicate inconsistency.
Step 1: Divide key space into buckets (4 in our example) as shown in the following figure. A bucket is used as the root level node to maintain a limited depth of the tree.
Step 2: Once the buckets are created, hash each key in a bucket using a uniform hashing method.
Step 3: Create a single hash node per bucket.
Step 4: Build the tree upwards till root by calculating hashes of children.
To compare two Merkle trees, start by comparing the root hashes. If root hashes match, both servers have the same data. If root hashes disagree, then the left child hashes are compared followed by right child hashes. You can traverse the tree to find which buckets are not synchronized and synchronize those buckets only.
Using Merkle trees, the amount of data needed to be synchronized is proportional to the differences between the two replicas, and not the amount of data they contain. In real-world systems, the bucket size is quite big. For instance, a possible configuration is one million buckets per one billion keys, so each bucket only contains 1000 keys.
Handling data centre outage
Data centre outages could happen due to power outages, network outages, natural disasters, etc. To build a system capable of handling data centre outages, it is important to replicate data across multiple data centres. Even if a data centre is completely offline, users can still access data through the other data centres.
Now that we have discussed different technical considerations in designing a key-value store, we can shift our focus on the architecture diagram, shown in the following figure.
Main features of the architecture are listed as follows:
As the design is decentralized, each node performs many tasks as presented in the following figure.
The following figure explains what happens after a write request is directed to a specific node. Please note the proposed designs for write/read paths are primary based on the architecture of Cassandra.
After a read request is directed to a specific node, it first checks if data is in the memory cache. If so, the data is returned to the client as shown in the following figure.
If the data is not in memory, it will be retrieved from the disk instead. We need an efficient way to find out which SSTable contains the key. Bloom filter is commonly used to solve this problem.
The read path is shown in the following figure when data is not in memory.
These 2 posts cover many concepts and techniques. To refresh your memory, the following table summarizes features and corresponding techniques used for a distributed key-value store.
For previous post check out - https://fifo.im/p/5utedt21d9w0
Thanks for reading out, I hope you have a nice day!
Design A News Feed System In this post, we are going to design a news feed system. What is a news feed? According to the Facebook help page, "News feed is the constantly updating list of stories in the middle of your home page. News Feed includes status updates, photos, videos, links, app activity, and likes from people, pages, and groups you follow on Facebook".
Design A Notification System A notification system has already become a very popular feature for many applications in recent years. A notification alerts users with important information like breaking news, product updates, events, offerings, etc. It has become an indispensable part of our daily life.