A journaling file system is a special type of file system that maintains a tracking file, called a journal. The journal enables the system to repair any inconsistencies that may arise as a result of a system halted abnormally. It does this by keeping track of changes that are made before committing them to the main file system. In the event that the computer is not shut down properly, any data loss can be recreated. This type of file system is therefore less likely to suffer from corruption, and brings file systems back online quickly.
The easiest decision is no decision. Let’s have two user interfaces, two modes: The easy mode for my mother-in-law, and the pro mode for engineers, McKinsey consultants, and investment bankers. Such dual-mode systems haven’t been very popular so far, it’s been tried without success on PCs and Macs. (Re-reading this, I realise the Mac itself could be considered such a dual-mode machine: Fire up the Terminal app, and you have access to a certified Unix engine living inside)
Linus Torvalds and others in the past have characterized FUSE file-systems as being for toys and misguided people, but FUSE has been used before for bringing Sun/Oracle’s ZFS to Linux, various other creative file-system implementations, and now exFAT. ExFAT support for Linux has been talked about going back to early 2009 but the support has been crap on Linux.
I always find filesystem debates fascinating.
In this short post, I’d like to show how hash-DoS can be applied to the btrfs file-system with some astonishing and unexpected success. Btrfs, while still in development stage, is widely considered as being a viable successor of ext4, and an implementation of it is already part of the Linux kernel. According to this page,
As a warning for those who are normally quick to upgrade to the latest stable vanilla kernel releases, a serious EXT4 data corruption bug worked its way into the stable Linux 3.4, 3.5, and 3.6 kernel series.
The reason why the problem happens rarely is that the effect of the buggy commit is that if the journal’s starting block is zero, we fail to truncate the journal when we unmount the file system. This can happen if we mount and then unmount the file system fairly quickly, before the log has a chance to wrap. After the first time this has happened, it’s not a disaster, since when we replay the journal, we’ll just replay some extra transactions. But if this happens twice, the oldest valid transaction will still not have gotten updated, but some of the newer transactions from the last mount session will have gotten written by the very latest transacitons, and when we then try to do the extra transaction replays, the metadata blocks can end up getting very scrambled indeed.
But, dig beneath the hood of this story—and the diagram included—and you’ll see another story. One that points to the key role of open source software in making this phenomenal mission work and the results available to so many, so quickly.
Perhaps the most important piece of this high-demand configuration, GlusterFS is an open source, distributed file system capable of scaling to several petabytes (actually, 72 brontobites!) and handling thousands of clients. GlusterFS clusters together storage building blocks over Infiniband RDMA or TCP/IP interconnect, aggregating disk and memory resources and managing data in a single global namespace.
F2FS is a new file system carefully designed for the NAND flash memory-based storage devices. We chose a log structure file system approach, but we tried to adapt it to the new form of storage. Also we remedy some known issues of the very old log structured file system, such as snowball effect of wandering tree and high cleaning overhead.
AFS is a distributed filesystem product, pioneered at Carnegie Mellon University and supported and developed as a product by Transarc Corporation (now IBM Pittsburgh Labs). It offers a client-server architecture for federated file sharing and replicated read-only content distribution, providing location independence, scalability, security, and transparent migration capabilities. AFS is available for a broad range of heterogeneous systems including UNIX, Linux, MacOS X, and Microsoft Windows
IBM branched the source of the AFS product, and made a copy of the source available for community development and maintenance. They called the release OpenAFS.
Most applications do not deal with disks directly, instead storing their data in files in a file system, which protects us from those scoundrel disks. After all, a key task of the file system is to ensure that the file system can always be recovered to a consistent state after an unplanned system crash (for example, a power failure). While a good file system will be able to beat the disks into submission, the required effort can be great and the reduced performance annoying. This article examines the shortcuts that disks take and the hoops that file systems must jump through to get the desired reliability.
Luckily, SATA (serial ATA) has a new definition called NCQ (Native Command Queueing) that has a bit in the write command that tells the drive if it should report completion when media has been written or when cache has been hit. If the driver correctly sets this bit, then the disk will display the correct behavior.
In the real world, many of the drives targeted to the desktop market do not implement the NCQ specification. To ensure reliability, the system must either disable the write cache on the disk or issue a cache-flush request after every metadata update, log update (for journaling file systems), or fsync system call. Both of these techniques lead to noticeable performance degradation, so they are often disabled, putting file systems at risk if the power fails. Systems for which both speed and reliability are important should not use ATA disks. Rather, they should use drives that implement Fibre Channel, SCSI, or SATA with support for NCQ.
On Linux here’s how you can check if your drive has NCQ.
$ cat /sys/block/sd?/device/queue_depth
A 1 indicates no NCQ. and
$ cat /sys/block/sd?/device/queue_type
My green drives came back with none.
GFS saves its file system descriptors in inodes that are allocated dynamically (referred to as dynamic nodes or dinodes). They are placed in a whole file system block (4096 bytes is the standard file system block size in Linux kernels). In a cluster file system, multiple servers access the file system at the same time; hence, the pooling of multiple dinodes in one block would lead to more competitive block accesses and false contention. For space efficiency and reduced disk accesses, file data is saved (stuffed) the dinode itself if the file is small enough to fit completely inside the dinode. In this case, only one block access is necessary to access smaller files. If the files are bigger, GFS uses a “flat file” structure. All pointers in a dinode have the same depth. There are only direct, indirect, or double indirect pointers. The tree height grows as much as necessary to store the file data as shown in Figure 1.