POSIX compatible Distributed File System with

Building a own POSIX compatible Distributed File System with Netty

Let us build our own resilient distributed storage (file) system with support for storing and retreiving files in chunks just like HDFS.

Such distributed storage systems can be used for developing applications like Dropbox that allows storing video files. In the next article, we will extends this DFS for video specific file storage like Youtube, Netflix.


    Functional Requirements:
    Non Functional Requirements:
    Extended Requirements:
  • Parallel retrievals: Large files will be split into multiple chunks. Client applications should retrieve these chunks in parallel using threads.
  • Probabilistic Routing: to enable lookups without requiring excessive RAM, client requests will be routed probabilistically to relevant storage nodes via bloom filters.
  • Interoperability: The DFS will use Google Protocol Buffers to serialize messages. This allows other applications to easily implement the wire format.
  • Asynchronous Scalability: Non-blocking I/O should be used to ensure the DFS can scale to handle hundreds of active client connections concurrently.
  • Fault tolerance: The system must be able to detect and withstand two concurrent storage node failures and continue operating normally. It will also be able to recover corrupted files.
  • POSIX Compatibility: Unlike many other DFS, this DFS will be POSIX-compatible, meaning that the file system can be mounted like any other disk on the host operating system.

Why do we need a Distributed File System?

We could have simply connected different storage systems with network to develop a Network File System: What if we need compute along with storage: To do some processing, within a server. Advantages of having a memory: Use it for caching Use it for in-memory data processing It allows to connect different computers on different networks. We could use NFS-Ganesha (used by GlusterFS) NFS vs DFS https://www.edureka.co/community/3136/difference-between-hdfs-and-nfs Other than fault tolerance, DFS like HDFS does support multiple replicas of files. Data Recovery. Because the storage is Cheap, we can afford to have multiple replicas on different systems. This eliminates (or eases) the common bottleneck of many clients accessing a single file. Since files have multiple replicas, on different physical disks, reading performance scales better than NFS. Ref

Storage Model

Block Storage
  • Block storage chops data into blocks and stores them as separate pieces.
  • Why Block Storage? It decouples the data from the user’s environment and spreads it across multiple environments that can better serve the data. Unlike file storage, block storage doesn’t rely on a single path, files can be retrieved quickly with block storage.

Back of the envolope estimation:

We assume our DFS will be read heavy. The read to write ratio will be around 100:1.
Capacity Estimattion
We assume the DFS to be location in one geographical region in a single data center.
    System Scale Estimation
  • Requests per second
  • asfa
    • Storage for x years based on size of 1 object (GB/TB/PB)
  • asfa
    • Memory Estimate (Caching for hot data) (GB/TB/PB)
  • asfa
    • Network Bandwith Estimation
  • New Data per second (Write): test
  • URL Clicks/Redirection( Read): test
  • Incoming Data/Traffic (Write): test
  • Outgoing Data/Traffic ( Read): test
  • Storage for x years (Write): test
  • Memory for Cache (Write): test
  • System Interface Definition

    - What is expected from the System - CLI - APIs : user_dev_key

    Defining the data models

    - Storage, Transportaion, Encrption

    High Level Design

    Datanode Namenode

    System Components

    • Controller
    • Storage Node
    • Client:
    The Controller is responsible for managing resources in the system, somewhat like an HDFS NameNode.
    • When a new storage node joins the DFS, the first thing it does is contact the Controller.
    Storage Node
    Storage nodes are responsible for storing and retrieving file chunks.
    • When a chunk is stored, it will be checksummed so on-disk corruption, it can be detected.
    The only functionality with the client is to store and retreive the files.

    The client should be unaware of any operations that are being carried out at the Controller and the storage node

    Bloom Filters

    Planning the Design

    Molestiae cupiditate inventore animi, maxime sapiente optio, illo est nemo veritatis repellat sunt doloribus nesciunt! Minima laborum magni reiciendis qui voluptate quisquam voluptatem soluta illo eum ullam incidunt rem assumenda eveniet eaque sequi deleniti tenetur dolore amet fugit perspiciatis ipsa, odit. Nesciunt dolor minima esse vero ut ea, repudiandae suscipit!

    #2. Creative WordPress Themes

    Temporibus ad error suscipit exercitationem hic molestiae totam obcaecati rerum, eius aut, in. Exercitationem atque quidem tempora maiores ex architecto voluptatum aut officia doloremque. Error dolore voluptas, omnis molestias odio dignissimos culpa ex earum nisi consequatur quos odit quasi repellat qui officiis reiciendis incidunt hic non? Debitis commodi aut, adipisci.

    Detailed Design

    Why Netty?

    CAP Theorem:

    We will focus on Availability and Partition tolerance over Strong/Strict Consistency.
    Copyright: Create own image
    • Consistency: Our DFS will follow Eventual Consistency model.
    • Availability: We assume that the data will be always available, irrespective of any number of node failure. But, there should be atleast 3 nodes availble.
    • Partition tolerance:

    Partitioning Tolerance:

    Horizontal Vs Verical Data Partitioning


    Bloom Filters Our bloom filters will use murmur3 hashing algorithm. MurmurHash is a non-cryptographic hash function suitable for general hash-based lookup.
    Why MurmurHash algorithm.?
    When compared with other hash algorithms, the MurmurHash algorithm takes the least average time hashing, even with higher number of collisions. Also, when working with Random UUIDs, it still takes the least time, when compared with its competitors like FNV-1, DJB2 or SuperFastHash. More about Murmur3 hash algorithm. The MurmurHash 3 is preferred over MurmurHash 2 because even if MurmurHash 2 computes two 32-bit results in parallel and mixes them at the end, which is fast but means that collision resistance is only as good as a 32-bit hash. Murmurhash3 allows a 128-bit variant, which might be more along the lines of what he's looking for (the original post mentions SHA256).
    Why Protocol Buffers.?
    Protocol Buffers vs Avro

    Identifying and Resolving the bottlenecks

    Identifying Single Point of Faliure
    Resolving the bottlenecks
    Once completed, request Prof. Mathew for proof reading.
    Design Document here
    Image placeholder

    Ayush Arora

    Software Developer with years of experience designing, developing and implementing highly scalable systems.

    Currently, learning to build large-scale Data Processing systems on the cloud (GCP & AWS) using Hadoop and Spark.