In my last post I had made some comments about the Microsoft Windows not being capable of enterprise high performance computing. In the comments (upon request) I had posted some details on the SCSI subsystem of the Operating System, talking of the scatter gather lists when sequential SCSI commands are being coalesced just prior to being sent to the SCSI-based media. I wanted to continue on that topic and focus specifically on the NTFS file system and why it too is not intended for enterprise class usage.
First I wish to start with the NTFS layout on the volume. Note that I will be using the term volume to signify either a pool of disk devices (traditionally pooled together via an external controller and mapped as a Logical Unit) or a single disk device. In both cases, they are represented to the host as a single volume. I may be a little old-fashioned by following in the footsteps of our UNIX forefathers but I have always felt that if you want an increase in disk access performance, then you keep your data and meta-data close to each other. We find that most traditional POSIX compliant file systems that are utilized today for server/client usage separate a volume in Allocation Groups (or Group Blocks), and within each AG there exists meta-data related blocks followed by the actual data blocks for the data file. Microsoft on the other hand had continued the path of their original FAT file systems and placed all meta-data at the VERY beginning of the volume while all fragmented data file are scattered throughout the rest of the volume. Logically placing meta data region close to data regions decreases seeking latencies. For example, let us say you are writing to a 500 GB or even a 1 TB volume (especially now since a single SATA disk device goes even higher in capacity). For a moment, ignoring journaling concepts and utilizing an NTFS file system will require for constant seeks back to the MFT (Master File Table) located at the very beginning of the file system no matter how far into the volume the data gets written. This layout is obviously not intended for high performance environments. When we look at file systems such as Ext2/3-fs, XFS and so on we see that the volume (as mentioned above) is separated in multiple equal sized groups, where each group contains all meta-data associated with its file data.
Moving on to the journal, what makes the NTFS truly stand out from its FAT predecessors is the ability to journal any and all changes made to the file system for quick recovery in an event of failure. With that journal comes performance loss. Most traditional UNIX/Linux file systems utilize a journal but lately there has been a movement to adopt log structured concepts. The functions of the journal is that it must first copy all changes to the file system in a journal prior to all data being committed to disk. If the system failed and the operations were incomplete, the journal is then played back to resolve any problems that may have been caused by the failure. Ideally the journal is implemented for speedy recoveries. There are two popular types of journaling methods: (1) meta-data only (2) and meta-data + file data. The latter is the default to the Ext3-fs but can always be tuned for meta-data only. Unfortunately this is not well known. While Ext3-fs with default settings offers the most redundant solution in times of failure this takes even more of a performance cost when writing all meta- and file data to the volume twice, one to the journal and again to the file system. NTFS on the other hand (as is seen in XFS, ReiserFS, etc.) does meta-data only journaling. As I had mentioned earlier in this paragraph this too takes a performance hit and that is why a lot of the more recently developed file system (i.e. ZFS, btrfs, Reiser4 etc.) have adopted a unique method of logging, where all new data gets written to a new location (ideally in sequence to the last written location of the volume) on the volume and upon success all meta-data is updated to point to the new region(s) for the file. In this scenario, nothing gets written twice and that is one of many reasons as to why ZFS and hopefully what will soon be a stable btrfs are classified as enterprise class file systems. This log structure also allows for easy implementations of snapshot features, which NTFS calls Volume Shadow Copy or Volume Snapshot Service (VSS) and unfortunately I do not know enough of this to comment. All that I do know is that VSS is not internal to the file system and must run on top of it as an additional service.
Some other major drawbacks is that some key NTFS file system maintenance MUST be done offline! Volume resizing (increase/decrease) on the enterprise level can be a frequent task. Taking the volume (and node) offline to accomplish this task is not good at all! This costs time which in turn costs money.
With regards to allocation sizes (also known as block size and in the Microsoft world as cluster size) is the file system’s minimum size of a unit partition to which data will get written to. To clarify let us say you are using a block size of 4 KB and you write 2 MB of data (2097152 / 4096), you will be using 512 of 4 KB blocks to store that data. On the other hand, if you write 579 bytes of data, that will use 1 of 4 KB block(s) and (unless tail packing is supported in the file system which is not provided in NTFS) the rest of the 4 KB data block will be wasted until that file grows in size or is deleted to eventually have that region written over with something larger. As of NT 3.51 and later NTFS supports 512 bytes, 1 KB, 2 KB and 4 KB. For high performance computing and especially working with larger file sizes, it is sometimes needed to go larger. XFS starts at 512 bytes and can go up as high as 64 KB. ZFS can go up to 128 KB. These high numbers do increase performance when working with large files. It also presents the volume with higher capacity range as less meta-data is utilized.
If I really had the time, this list could go on, but I wanted to shed some light on one last point and that is Volume Mount Points. In a Microsoft environment you are limited to the total numbers of the English alphabet. This is not the case with POSIX-like platforms. Again, in an enterprise environment, there may be a need to handle multiple volumes, things can get a bit problematic if the Microsoft server is attempting to serve more than 26 (but if you count the fact that A, B, and C drives are by default taken by the floppy and root operating system mount devices, then you have 23 left for CD/DVD, tape and other disk media devices.
Don’t get me wrong, just like anything other file system, NTFS can be tuned for better results. You can get better performance out of any file system, if you can search out the proper documentation for it and the documentation is all over the internet.
With these limitations well known, then why do we still try to deploy Microsoft Windows in environments it was not suited for? The answer is familiarity. Microsoft for the most part owns the client/end-user market and with that the end-user has gotten too familiar and too comfortable with its platform. In turn what was built for home (and to an extent small business) use has leaked into an environment where it is not ready for. Please understand that I am not trying to preach against Microsoft and attack them. As many others in the high performing server/storage industry I have come to understand where certain problems originate from and that includes the limitations of the Windows platform. If you, the reader, feel something different with Microsoft and their role in enterprise class computing please feel free to comment. I know that I may not always be correct in my viewpoints and if you can shed any additional light I would very grateful.