The Need:
- Component failures normal
强假设,错误是正常的,可以硬件-交换机网络等,软件-操作系统、应用程序等,可以是人为或非人等
- Files are huge
TB级的文件被视为常见的大小,单位文件的大小也是较大的(MB级以上)
- Most mutations are mutations
变化—》指对文件内容(元数据)的变化——>在文件系统中归纳为写操作以及追加操作等,且随机的访问是较少的
- Co-Designing apps & file system
文件和应用程序和文件系统 API 的协同设计—>提高了整个系统的灵活性。比如:放松了对 GFS一致性模型的要求,这样就减轻了传统文件系统对应用程序的苛刻要求,大大简化了 GFS 的设计。我们引入了原子性的记录追加操作,从而保证多个客户端能够同时进行追加操作,不需要额外的同步操作来保证数据的一致性
- Typical: 1000 nodes & 300TB
Desiderata
- Must monitor & recover from comp failures
- Modest number of large files
- Workload
-Large streaming reads + small ramdom reads
-Many large sequential writes
- Random access overwrites don’t need to be efficient
- Need semantics from concurrent appends
- High sustained bandwidth
-More important than low latency
取决于每个操作者希望在第一时间将命令执行下去—会造成都无法进行,所以设计了一套顺序机制保证运行,等待的操作者须不在意延迟
Interface
- Familiar
-Create,dalete,open,close,read,write
- Novel
-Snapshot 快照 一个目录的备份,不与拷贝完全相等,是应用级别的
Low cost
-Record append 追加,与常规文件系统不同,原子级操作(当进行某一append时,不会被其他并行操作影响,并且不需要同步机制)
Atomicity with multiple concurrent writes
Architecture
关键点:怎么让master之间的信息交互
第一步,client需对master进行申请,交流的内容是元数据(目录位置等,不是数据本身)
第二步,client和某个chunk server进行数据的获取
Chunk
- Store all files
–In fixed-size chucks
• 64 MB chunk容量越大,master负载越小
• 64 bit unique handle
- Triple redundancy
Master
- Stores all metadata
–Namespace 名字空间,逻辑上的名字的空间,chunk的名字空间,
–Access-control information 映射表(沟通名字之间)
–Chunk locations chunk的物理地址,包括了备份地址
–‘Lease’ management 管理同步、写的机制
- Heartbeats 周期性的检查操作
- Having one master —> global knowledge
–Allows better placement / replication
–Simplifies design
Client
- GFS code implements API
- Cache only metadata 用户只缓存元数据,好处是反复访问时可节约时间,而缓存真实数据较大无必要

Metadata
- Master stores three types
–File & chunk namespaces
–Mapping from files —> chunks
–Location of chunk replicas
- Stored in memory
- Kept persistent thru logging
Consistency Model
- Consistent = all clients see same data
- Defined = consistent + clients see full effect of mutation
- Key: all replicas must process chunk-mutation requests in same order
- Different clients may see different data
A write causes data to be written at an application-specified file offset. A record append causes data (the “record”) to be
appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing
The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations.
A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.
A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.
Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written.
Implications
- Apps must rely on appends, not overwrites
- Must write records that
-Self-validate
-Self-identify
- Typical uses
-Single writer writes file from beginning to end, then renames file (or checkpoints along way)
-Many writers concurrently append
- At-least-once semantics ok
- Reader deal with padding & duplicates
relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.
Checkpoints may also include application-level checksums.
Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state.
Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums.
Leases & Mutation Order
- Objective
-Ensure data consistent & defined
-Minimize load on master
- Master grants ‘lease’ to one replica
-Called ‘primary’ chunkserver
- Primary serializes all mutation requests
-Communicates order to replicas
A mutation is an operation that changes the contents or metadata of a chunksuch as a write or an append operation.
Each mutation is performed at all the chunk’s replicas.
The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations
on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires.
Write Control & Dataflow
- The client asks the master which chunkserver holds the current lease for the chunk and the locations of
the other replicas. If no one has a lease, the master grants one to a replica it chooses (not shown). - The master replies with the identity of the primary and the locations of the other (secondary) replicas. The
client caches this data for future mutations. It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease. - The client pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out. By decoupling the data flow from the control flow, we can improve performance by
scheduling the expensive data flow based on the network topology regardless of which chunkserver is the primary. - Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary.
The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to its own local state
in serial number order. - The primary forwards the write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary.
- The secondaries all reply to the primary indicating that they have completed the operation.
- The primary replies to the client. Any errors encountered at any of the replicas are reported to the client.
In case of errors, the write may have succeeded at the primary and an arbitrary subset of the secondary replicas. (If it had failed at the primary, it would not have been assigned a serial number and forwarded.) The client request is considered to have failed, and the
modified region is left in an inconsistent state. Our client code handles such errors by retrying the failed mutation. It will make a few attempts at steps (3) through (7) before falling back to a retry from the beginning of the write.
Atomic Appends
- As in last slide, but…
- Primary also checks to see if append spills over into new chunk
-If so, pads old chunk to full extent
-Tells secondary chunk-servers to do the same
-Tells client to try append again on next chunk
- Usually works because
-max(append-size) < ¼ chunk-size [API rule]
-(meanwhile other clients may be appending)
If a record append fails at any replica, the client retries the operation. As a result, replicas of the same chunkmay contain different data possibly including duplicates of the same record in whole or in part. GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit.
Other Issues
- Fast snapshot
- Master operation
-Namespace management & locking
-Replica placement & rebalancing
-Garbage collection (deleted / stale files)
-Detecting stale replicas
The snapshot operation makes a copy of a file or a directory tree (the “source”) almost instantaneously, while minimizing any interruptions of ongoing mutations
copy-on-write:
When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files it is about to snapshot. This ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. This will give the master an opportunity to create a new copy of the chunk first.
After the leases have been revoked or have expired, the master logs the operation to disk. It then applies this log record to its in-memory state by duplicating the metadata for the source file or directory tree. The newly created snapshot files point to the same chunks as the source files.
The first time a client wants to write to a chunkC after the snapshot operation, it sends a request to the master to find the current lease holder. The master notices that the reference count for chunkC is greater than one. It defers replying to the client request and instead picks a new chunkhandle C’. It then asks each chunkserver that has a current replica of C to create a new chunkcalled C’. By creating the new chunkon the same chunkservers as the original, we ensure that the data can be copied locally, not over the network
Master Operation – Namespace Management & Locking
Locks are used over namespaces to ensure proper serialization
Read/write locks
GFS simply uses directory like file names : /foo/bar
GFS logically represents its namespace as a lookup table mapping full pathnames to metadata
If a Master operation involves /d1/d2/../dn/leaf, read locks are acquired on d1,/d1/d2,..d1/d2/../leaf and either a read or write lock on the full pathname /d1/d2/…..dn/leaf
The master executes all namespace operations. In addition, it manages chunkreplicas throughout the system: it makes placement decisions, creates new chunks and hence replicas, and coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers, and to reclaim unused storage.
GFS logically represents its namespace as a lookup table mapping full pathnames to metadata
We now illustrate how this locking mechanism can prevent a file /home/user/foo from being created while /home/user is being snapshotted to /save/user. The snapshot operation acquires read lock s on /home and /save, and write locks on /home/user and /save/user. The file creation acquires read locks on /home and /home/user, and a write lockon /home/user/foo. The two operations will be serialized properly because they try to obtain conflicting locks on /home/user. File creation does not require a write lock on the parent directory because there is no “directory”, or inode-like, data structure to be protected from modification. The read lockon the name is sufficient to protect the parent directory from deletion
One nice property of this locking scheme is that it allows concurrent mutations in the same directory.
Master Operation
- Replica Placement
-Maximize data reliability and availability
-Maximize network bandwidth utilization
- Re-replication
-The Master Re-replicates a chunk as soon as the number of available replicas falls below a user specified goal
- Rebalancing
-The Master Rebalances the replicas periodically (examines replicas distribution and moves replicas for better disk space and load balancing)
We must also spread chunkreplicas across racks. This ensures that some replicas of a chunk will survive and remain available even if an entire rackis damaged or offline
(1) We want to place new replicas on chunkservers with below-average diskspace utilization. (2) We want to limit the number of “recent” creations on each chunkserver. (3) As discussed above, we want to spread replicas of a chunkacross racks.
Each chunkthat needs to be re-replicated is prioritized based on several factors. One is how far it is from its replication goal. In addition, we prefer to first re-replicate chunks for live files as opposed to chunks that belong to recently deleted files. Finally, to minimize the impact of failures on running applications, we boost the priority of any chunk that is blocking client progress.
Master Operation
- Garbage collection
-Lazy deletion of files
-Master deletes a hidden file during its regular scan if the file have existed for 3 days
-HeartBeat messages are used to inform the chunk servers about the deleted files chunks
- Stale Replica Detection
-The Master maintains a chunk version number
-The Master removes stale replicas in its regular garbage collection
When a file is deleted by the application, the master logs the deletion immediately just like other changes. However instead of reclaiming resources immediately, the file is just renamed to a hidden name that includes the deletion time stamp.
Whenever the master grants a new lease on a chunk, it increases the chunkversion number and informs the up-to date replicas. The master and these replicas all record the new version number in their persistent state. This occurs before any client is notified and therefore before it can start writing to the chunk. If another replica is currently unavailable, its chunkversion number will not be advanced.
As another safeguard, the master includes the chunkversion number when it informs clients which chunkserver holds a lease on a chunk
High Availability
- Fast Recovery
- Chunk Replication
- Master Replication

The master state is replicated for reliability. Its operation log and checkpoints are replicated on multiple machines.
Moreover, “shadow” masters provide read-only access to the file system even when the primary master is down.
What If the Master Reboots?
- Replays log from disk
-Recovers namespace (directory) information
-Recovers file-to-chunk-ID mapping
- Asks chunkservers which chunks they hold
-Recovers chunk-ID-to-chunkserver mapping
- If chunk server has older chunk, it’s stale
-Chunk server down at lease renewal
- If chunk server has newer chunk, adopt its version number
-Master may have failed while granting lease
What if Chunkserver Fails?
- Master notices missing heartbeats
- Master decrements count of replicas for all chunks on dead chunkserver
- Master re-replicates chunks missing replicas in background
-Highest priority for chunks missing greatest number of replicas
