Building POSIX compatible Distributed File System

Building a own POSIX compatible Distributed File System with Netty

Let us build our own resilient distributed storage (file) system with support for storing and retreiving files in chunks just like HDFS.

Such distributed storage systems can be used for developing applications like Dropbox that allows storing video files. In the next article, we will extends this DFS for video specific file storage like Youtube, Netflix.

Requirements:

Functional Requirements:

Non Functional Requirements:

Extended Requirements:

Parallel retrievals: Large files will be split into multiple chunks. Client applications should retrieve these chunks in parallel using threads.
Probabilistic Routing: to enable lookups without requiring excessive RAM, client requests will be routed probabilistically to relevant storage nodes via bloom filters.
Interoperability: The DFS will use Google Protocol Buffers to serialize messages. This allows other applications to easily implement the wire format.
Asynchronous Scalability: Non-blocking I/O should be used to ensure the DFS can scale to handle hundreds of active client connections concurrently.
Fault tolerance: The system must be able to detect and withstand two concurrent storage node failures and continue operating normally. It will also be able to recover corrupted files.i.e support Read Repair
POSIX Compatibility: Unlike many other DFS, this DFS will be POSIX-compatible, meaning that the file system can be mounted like any other disk on the host operating system.

Why do we need a Distributed File System?

We could have simply connected different storage systems with network to develop a Network File System: What if we need compute along with storage: To do some processing, within a server. Advantages of having a memory: Use it for caching Use it for in-memory data processing It allows to connect different computers on different networks. We could use NFS-Ganesha (used by GlusterFS) NFS vs DFS https://www.edureka.co/community/3136/difference-between-hdfs-and-nfs Other than fault tolerance, DFS like HDFS does support multiple replicas of files. Data Recovery. Because the storage is Cheap, we can afford to have multiple replicas on different systems. This eliminates (or eases) the common bottleneck of many clients accessing a single file. Since files have multiple replicas, on different physical disks, reading performance scales better than NFS. Ref

Storage Model

Block Storage

Block storage chops data into blocks and stores them as separate pieces.
Why Block Storage? It decouples the data from the user’s environment and spreads it across multiple environments that can better serve the data. Unlike file storage, block storage doesn’t rely on a single path, files can be retrieved quickly with block storage.

Back of the envolope estimation:

We assume our DFS will be read heavy. The read to write ratio will be around 100:1.

Capacity Estimattion

We assume the DFS to be location in one geographical region in a single data center.

System Scale Estimation

Requests per second

asfa

Storage for x years based on size of 1 object (GB/TB/PB)

asfa

Memory Estimate (Caching for hot data) (GB/TB/PB)

asfa

Network Bandwith Estimation

Summary

New Data per second (Write): test

URL Clicks/Redirection( Read): test

Incoming Data/Traffic (Write): test

Outgoing Data/Traffic ( Read): test

Storage for x years (Write): test

Memory for Cache (Write): test

System Interface Definition

- What is expected from the System - CLI - APIs : user_dev_key

Defining the data models

- Storage, Transportaion, Encrption

High Level Design

Datanode Namenode

System Components

Controller
Storage Node
Client:

Controller

The Controller is responsible for managing resources in the system, somewhat like an HDFS NameNode.

When a new storage node joins the DFS, the first thing it does is contact the Controller.

more...

Storage Node

Storage nodes are responsible for storing and retrieving file chunks.

When a chunk is stored, it will be checksummed so on-disk corruption, it can be detected.

more...

Client

The only functionality with the client is to store and retreive the files.

The client should be unaware of any operations that are being carried out at the Controller and the storage node

Bloom Filters

Planning the Design

Molestiae cupiditate inventore animi, maxime sapiente optio, illo est nemo veritatis repellat sunt doloribus nesciunt! Minima laborum magni reiciendis qui voluptate quisquam voluptatem soluta illo eum ullam incidunt rem assumenda eveniet eaque sequi deleniti tenetur dolore amet fugit perspiciatis ipsa, odit. Nesciunt dolor minima esse vero ut ea, repudiandae suscipit!

#2. Creative WordPress Themes

Temporibus ad error suscipit exercitationem hic molestiae totam obcaecati rerum, eius aut, in. Exercitationem atque quidem tempora maiores ex architecto voluptatum aut officia doloremque. Error dolore voluptas, omnis molestias odio dignissimos culpa ex earum nisi consequatur quos odit quasi repellat qui officiis reiciendis incidunt hic non? Debitis commodi aut, adipisci.

Detailed Design

Why Netty?

CAP Theorem:

We will focus on Availability and Partition tolerance over Strong/Strict Consistency.

Consistency: Our DFS will follow Eventual Consistency model.
Availability: We assume that the data will be always available, irrespective of any number of node failure. But, there should be atleast 3 nodes availble.
Partition tolerance:

Partitioning Tolerance:

Horizontal Vs Verical Data Partitioning

Tradeoffs:

Bloom Filters Our bloom filters will use murmur3 hashing algorithm. MurmurHash is a non-cryptographic hash function suitable for general hash-based lookup.

Why MurmurHash algorithm.?

When compared with other hash algorithms, the MurmurHash algorithm takes the least average time hashing, even with higher number of collisions. Also, when working with Random UUIDs, it still takes the least time, when compared with its competitors like FNV-1, DJB2 or SuperFastHash. More about Murmur3 hash algorithm. The MurmurHash 3 is preferred over MurmurHash 2 because even if MurmurHash 2 computes two 32-bit results in parallel and mixes them at the end, which is fast but means that collision resistance is only as good as a 32-bit hash. Murmurhash3 allows a 128-bit variant, which might be more along the lines of what he's looking for (the original post mentions SHA256).