Reverse Engineering GFS (Google’s File System)

This unfinished article has been hanging around for sometime in my draft stories, it’s a collection of notes in Q&A format trying to reverse engineer Google File System. I never managed to finish off the article, so for now just publishing it publicly to not lose the invested effort

Basil A.
11 min readDec 25, 2023

Objective

Trying to reverse engineer the different components within Google’s File System (GFS) by going through the GFS White Paper (https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/035fc972c796d33122033a0614bc94cff1527999.pdf) and utilizing Prompt Engineering Tools (ChatGPT-4).

What is the heartbeat mechanism in GFS?

Master Server exposes a heartbeat endpoint, in pseudo representation it looks like the following:

POST /heartbeat/{chunkserver}
{
"availableLoad": xyz,
"availableStorageCapacity": xyz,
"stateOfStoredChunks": xyz,
"allChunkIds": [.. << List of Chunk Ids >> ..]
}

Chunk Servers periodically send /heartbeat(s) to the master server. With each heartbeat also sends information about storage capacity of the chunkserver, the available load on the server, and state of stored chunks.

Notice the all the chunk ids are sent by each heartbeat, and the master server sends with the response the chunk ids that no longer exist in it’s metadata so that the chunk server can delete this orphaned chunks as mentioned in the GFS white paper under Garbage Collection section:

Quoted from GFS White Paper:
In a HeartBeat message regularly exchanged with the master,
each chunkserver reports a subset of the chunks it has, and
the master replies with the identity of all chunks that are no
longer present in the master’s metadata. The chunkserver
is free to delete its replicas of such chunks.

How are these heartbeats processed by Master Server?

When the hearbeats are received by the Master Server the master server updates the records for these chunkservers’ operational availability. Think of it as a key/value store holding the chunk server hostname as the key and operational availability status as the value. Example pseudo code:

{
"chunk-server-1": "OKAY",
"chunk-server-2": "OKAY",
"chunk-server-3": "DOWN",
"chunk-server-4": "OKAY",
"chunk-server-5": "DOWN"
}

What happens when the heartbeat from a Chunk Server is no longer received by Master Server?

There is a specific time the Master Server waits for a heartbeat, if it’s not received then it considers the Chunk Server is down and starts the process of replicating all the chunks on the chunk server that went down by utilizing the corresponding chunk replicas on other servers.

Why does GFS make Chunk Servers push heartbeats instead of the Master Server polling them?

In the Google File System (GFS), having chunk servers initiate heartbeats instead of the master server polling each server significantly reduces network and processing overhead. This approach minimizes the number of active connections the master needs to manage, distributes the network load across all chunk servers, and conserves bandwidth, as only the heartbeat messages are transmitted, rather than continuous polling requests. This design is key to GFS’s scalability, enabling it to handle thousands of servers without overburdening the master server with network traffic and connection management tasks.

How are files stored in GFS?

Since this is not disclosed I’ll try to deconstruct how Master Server stores file information using data structures:

# Storing File-to-Chunk mapping (A File is comprised from an ordered list
# of chunks).
# The key here is the file name, and the value is an ordered list of
# chunk UUIDs
fileToChunks = {
"myfile.txt": [chunk_8923, chunk_8932, chunk_9203, chunk_1892],
"somefile.jpg": [chunk_1234, chunk_2323, chunk_2389, chunk_2389],
"another.doc" [chunk_2389, chunk_2378, chunk_3287]
}

# Storing Chunk Locations. The chunk locations are a key value map
# where the key is the chunk UUID and the value contains a primar server
# name and a replica server name since each chunk has three copies
chunkLocations = {
"chunk_1234": {
"primary": "host_6", "replicas": ["host_3", "host_5"]
},
"chunk_5678": {
"primary": "host_12", "replicas": ["host_98", "host_32"]
}
}

# Storing File Hierarchy of Folders and Files
# This data structure will hold the file hierarchy
## Here we can use a Trie or a B-Tree to store
# Won't demonstrate

How Does GFS Master Server Store Chunk Location Information?

Chunk location information is not stored persistently by master server. Instead, when master server starts it asks all chunk servers about their chunks and also does that whenever a chunk server joins the cluster.

We can assume that each chunk server has a GET endpoint like this that will be contacted by the Master Server:

# A pseudo endpoint looks like this exposed from each chunk server 
GET /getAllChunkIds

# Sample response
[
"<64-bit hex chunk id 1>",
"<64-bit hex chunk id 2>",
"<64-bit hex chunk id 3>",
... <all remaining chunks> ...
"<64-bit hex chunk id 928>"
]

Why is in-memory metadata storage a GFS?

TODO: Discuss this point:

We initially attempted to keep chunk location information
persistently at the master, but we decided that it was much
simpler to request the data from chunkservers at startup,
and periodically thereafter. This eliminated the problem of
keeping the master and chunkservers in sync as chunkservers
join and leave the cluster, change names, fail, restart, and
so on. In a cluster with hundreds of servers, these events
happen all too often.
Another way to understand this design decision is to realize
that a chunkserver has the final word over what chunks
it does or does not have on its own disks. There is no point
in trying to maintain a consistent view of this information
on the master because errors on a chunkserver may cause
chunks to vanish spontaneously (e.g., a disk may go bad
and be disabled) or an operator may rename a chunkserver.

What happens when a Chunk Server fails (dies)?

When a chunk server fails, the following will take place:

  1. The master server will notice since it will stop receiving the heartbeat.
  2. The master server will identify all chunks that are under replicated (possibly by looking up all chunks located in this chunk server from it’s metadata).
  3. For each chunk the master server will instruct one of the source chunk servers holding this chunk to replicate by pushing its chunk to a healthy destination chunk server.
  4. The master server will monitor that the chunk replication to the health server got completed.
  5. If the chunk server that later comes back online, then the chunks won’t be reused if out of date or have been replicated successfully on a different chunk server.

Can a chunk be modified?

No, chunks are immutable, they can’t be modified, however they can be appended to. Ability to append aids in replicating the chunks since each chunk is 64 MB in size.

If new data is written to a file, new chunks get generated and metadata gets updated for that file.

Why did Colossus (Google’s GFS successor) decrease the chunk size compared to GFS?

Google has never made public why they took this decision but ChatGPT seems to have a reasonable answer:

Colossus, the successor to the Google File System (GFS), made several improvements and changes to the original GFS design, including a reduction in chunk size. While Google has not publicly detailed all the reasons for this change, we can infer several likely motivations based on the general principles of distributed file systems:

Improved Storage Utilization: Smaller chunk sizes can lead to better storage utilization, especially for smaller files. In GFS, the large 64 MB chunk size could result in significant wasted space when storing many small files, as each file would consume at least one chunk. Reducing the chunk size helps mitigate this issue.

Faster Recovery from Failures: With smaller chunks, the system can recover more quickly from failures. In GFS, if a chunk server failed, each 64 MB chunk it hosted needed to be replicated from other servers, which could be time-consuming. Smaller chunks reduce the amount of data that needs to be transferred for replication, speeding up recovery times.

Load Balancing: Smaller chunks allow for more granular load balancing across the servers. It’s easier to distribute a larger number of smaller chunks evenly across servers than a smaller number of larger chunks, leading to more efficient utilization of resources.

Improved Performance for Small File Access: Smaller chunk sizes can enhance performance when dealing with small files, as reading or writing small files in a system with large chunks can be inefficient.

Handling Hotspots: In GFS, popular files (or chunks) could become hotspots, creating performance bottlenecks. Smaller chunks might help mitigate this problem by spreading the load more evenly across the system.

Adaptation to Changing Workloads: Google’s workloads and storage needs have evolved since GFS was first designed. Smaller chunk sizes in Colossus might reflect an adaptation to these changing requirements, such as an increase in the number of smaller files or different access patterns.

Enhanced Flexibility and Efficiency: Smaller chunks provide greater flexibility and can be more efficient in a system that is designed to store a vast and varied mix of file sizes, from very small to very large.

In summary, while the specific reasons for Colossus’s reduction in chunk size are not detailed in publicly available Google documentation, it is likely that this decision was motivated by factors such as improved storage utilization, faster recovery from failures, better load balancing, enhanced performance for small files, mitigation of hotspots, adaptation to evolving workloads, and overall system efficiency and flexibility.

From the explanation above, the most interesting point for me is how smaller chunks handle hotspots better as well as load-balancing in which it is easier to distribute a large number of smaller chunks across servers than replicated large chunks.

Which protocol does Colossus or GFS use to have servers communicate between each other?

This has not been disclosed, but Colossus might probably be using gRPC or a variant of gRPC. GFS could be using some custom protocol since it was built before gRPC.

How does GFS Deal With Hotspot Chunks?

Hotspots usually take place for small files that have few or only one chunk. The GFS white paper mentions a case where there was an executable stored in GFS that was requested at the same time by thousands of clients which burdened the three chunk servers holding that chunk and its two replicas. The solution to such cases would be to configure higher replicas (more than three replicas) for smaller files or maybe to allow the client to configure a higher replica setting if they know before-hand that this file will be requested heavily. They also minimized the issue by using staggering approach so that clients didn’t request at the same time.

How Does a Read Operation Take Place?

The following steps take place for a read:

  1. The client calls a (pseudo) method named read_file(file_name, byte_offset) and the function internally calculates the chunk index that should be retrieved. Knowing that the chunk index is 64MB, it easily calculates the chunk index as byte_offset % (64 * 1024 * 1024) which gives the exact index.
  2. The client then sends the Master Server a request containing the filename and chunk index, and the master server will return the unique chunk id (also called chunk handle) and the three chunk servers holding the chunk replicas. I assume the master server will have an endpoint like:
# Endpoint to get the chunk 64-bit uuid and chunk servers holding it
GET /getChunkLocation?fileName=$1&chunkIndex=$2

# Example request / response
Example Request:
GET /getChunkLocation?fileName=mylargefile.mpg&chunkIndex=5

Example Response:
{
"chunkHandle": "021ae7ceadda5cbf",
"hostLocations:" [
"chunkserver-1.gfs.com",
"chunkserver-2.gfs.com",
"chunkserver-3.gfs.com"
]
}

3. The client caches this metadata and sends a request to one of the replicas (usually based on criteria like proximity and load of the chunk server). The request specifies the chunk id and a byte range within that chunk. If the chunk server fails, the client tries another chunk server. The endpoint exposed from the chunk server will look like this:

# Chunk Server Endpoint
GET /readChunkData?chunkId=$1&byteRangeStart=4343&byteRangeEnd=329032

# Example response:
Content-Type: application/octet-stream
<<Binary Data Returned>>

4. Further reads of the same chunk requires no client to Master Server communication since the client already cached unless the cache expires or the file is reopened.

5. For the next chunk the client repeats the above steps to get information about chunk location.

How Does The Write Operation Work in GFS?

Here are the steps taken for a write operation:

  1. A client performs a write by calling (pseudo) method write_file(fileName, byte[] data) .
  2. The client contacts the GFS master server and secondary chunk server to where it wants to write the chunk. Here we imaging that the master server has this endpoint:
# Master server endpoint
POST /writeFile?fileName&$1&data=$2

# Response from master server to client
{
"primaryChunkServerToWriteTo": "chunk-19.gfs.com",
"secondaryChunkServersToWriteTo": ["chunk-6.gfs.com", ""chunk-9.gfs.com"]
}

3. The client pushes the data to all replicas (primary and secondaries) in parallel and each chunk server stores the data in a buffer in preparation to writing it to a file.

4. Once all replicas acknowledge back to the client that they have received the data, the client then sends a write request to the Primary Chunk Server.

5. The primary assigns a consecutive serial number to the operation to ensure all mutations are applied in a consistent order across all replicas. ?

6. The primary forwards the write command to all secondary replicas in the same order, where each replica applies the write to its copy of the chunk.

7. Secondaries acknowledge the completion of the write back to the primary.

8. Once the primary has received acknolwedgement from all secondary replicas, it responds back to the client confirming that the write was successful.

Note: GFS has a relaxed consistency model. After the completion of a write, data may not be immediately visible to subsequent read operations due to buffering and delays in propagation.

Notice that data flows from clients directly to three chunk servers. But coordination of the commit is only handled by the primary chunk server.

Error Handling: If any part of this process fails (e.g., a replica does not acknowledge the data transfer), the client retries the operation, possibly with different replicas as directed by the master server.

How Does Writing a New File Larger Thank 64 MB In Size Work?

The client simply repeats the previous steps described for every chunk. In each step it starts with requesting to allocate a chunk id from the master server as well as three chunk servers to write to. By the way, this chunk id and filename is stored in the metadata of the master server but marked as “un-commited” or “tenative” and not made visible until the write fully completes.

Would imagine on each interaction a part number is passed to the master server when allocating a new chunk id to inform it which part of the file is this chunk. See next question.

Can you describe how the endpoint APIs for the write interaction would look like?

In the chunk id allocation step we will have an endpoint on Master Server having a /allocateChunkToWrite :

# Master Server endpoint
POST /allocateChunkToWrite?fileName=$1&part=$2

# Response
{
"primaryChunkServer": "primary host",
"secondaryChunkServers": [..list of secondary chunk servers..]
}

In the next step the client sends data to the chunk servers in parallel, we expect each chunk server to have an endpoint POST /writeChunkData like this which writes the data in the chunk server in a buffered zone (until coordination completes):

# Chunkserver endpoint
POST /writeChunkData?chunkId=$1&chunkData=$2

# Response returns to the client informing that write is completed in buffered
# zone within chunk server
{
"result": "OK"
}

In the third step the client then sends a write request to the Primary Chunk Server only on an endpoint like “/commitWrite”:

# Chunk Server endpoint
POST /primaryCommitWrite?chunkId=$1&coordinateWith=[<replicas>]

In the fourth step, the primary chunk server will send coordination request to commit write on replica and return. The primary will count the commits completed on replicas:

# (Replica) Chunk Server endpoint for coordination contacted from Primary 
# Chunk Server
POST /secondaryCommitWrite?chunkId=$1

# Responds back to the Primary Chunk Server
{
"result": OKAY
}

Fourth, once the primary replica receives all successful commits from the secondary chunk servers it responds back to the client during the client initiated request to /primaryCommitWrite .

Fifth, the above is repeated for each chunk of the file. Once all chunks of the file are ready, the client finally informs master server to change the metadata state from “non-committed” to “committed” so that the metadata is visible by other clients. (This fifth step is just a speculation of what takes place at the end of the write).

How Does Deletion of a File Work in GFS?

<TODO>

What is a“Chunk Handle” unique id? How are they generated, used and stored?

How would the raw files representing the chunks be named?

How are the raw chunk files placed in the hierarchical file system?

--

--

Basil A.
Basil A.

Written by Basil A.

A Software Engineer with interests in System Design and Software Engineering done right.

No responses yet