Archive for December, 2008

Why Microsoft is just not ready for the enterprise.

December 31st, 2008 8 comments

In my last post I had made some comments about the Microsoft Windows not being capable of enterprise high performance computing. In the comments (upon request) I had posted some details on the SCSI subsystem of the Operating System, talking of the scatter gather lists when sequential SCSI commands are being coalesced just prior to being sent to the SCSI-based media. I wanted to continue on that topic and focus specifically on the NTFS file system and why it too is not intended for enterprise class usage.

First I wish to start with the NTFS layout on the volume. Note that I will be using the term volume to signify either a pool of disk devices (traditionally pooled together via an external controller and mapped as a Logical Unit) or a single disk device. In both cases, they are represented to the host as a single volume. I may be a little old-fashioned by following in the footsteps of our UNIX forefathers but I have always felt that if you want an increase in disk access performance, then you keep your data and meta-data close to each other. We find that most traditional POSIX compliant file systems that are utilized today for server/client usage separate a volume in Allocation Groups (or Group Blocks), and within each AG there exists meta-data related blocks followed by the actual data blocks for the data file. Microsoft on the other hand had continued the path of their original FAT file systems and placed all meta-data at the VERY beginning of the volume while all fragmented data file are scattered throughout the rest of the volume. Logically placing meta data region close to data regions decreases seeking latencies. For example, let us say you are writing to a 500 GB or even a 1 TB volume (especially now since a single SATA disk device goes even higher in capacity). For a moment, ignoring journaling concepts and utilizing an NTFS file system will require for constant seeks back to the MFT (Master File Table) located at the very beginning of the file system no matter how far into the volume the data gets written. This layout is obviously not intended for high performance environments. When we look at file systems such as Ext2/3-fs, XFS and so on we see that the volume (as mentioned above) is separated in multiple equal sized groups, where each group contains all meta-data associated with its file data.

Moving on to the journal, what makes the NTFS truly stand out from its FAT predecessors is the ability to journal any and all changes made to the file system for quick recovery in an event of failure. With that journal comes performance loss. Most traditional UNIX/Linux file systems utilize a journal but lately there has been a movement to adopt log structured concepts. The functions of the journal is that it must first copy all changes to the file system in a journal prior to all data being committed to disk. If the system failed and the operations were incomplete, the journal is then played back to resolve any problems that may have been caused by the failure. Ideally the journal is implemented for speedy recoveries. There are two popular types of journaling methods: (1) meta-data only (2) and meta-data + file data. The latter is the default to the Ext3-fs but can always be tuned for meta-data only. Unfortunately this is not well known. While Ext3-fs with default settings offers the most redundant solution in times of failure this takes even more of a performance cost when writing all meta- and file data to the volume twice, one to the journal and again to the file system. NTFS on the other hand (as is seen in XFS, ReiserFS, etc.) does meta-data only journaling. As I had mentioned earlier in this paragraph this too takes a performance hit and that is why a lot of the more recently developed file system (i.e. ZFS, btrfs, Reiser4 etc.) have adopted a unique method of logging, where all new data gets written to a new location (ideally in sequence to the last written location of the volume) on the volume and upon success all meta-data is updated to point to the new region(s) for the file. In this scenario, nothing gets written twice and that is one of many reasons as to why ZFS and hopefully what will soon be a stable btrfs are classified as enterprise class file systems. This log structure also allows for easy implementations of snapshot features, which NTFS calls Volume Shadow Copy or Volume Snapshot Service (VSS) and unfortunately I do not know enough of this to comment. All that I do know is that VSS is not internal to the file system and must run on top of it as an additional service.

Some other major drawbacks is that some key NTFS file system maintenance MUST be done offline! Volume resizing (increase/decrease) on the enterprise level can be a frequent task. Taking the volume (and node) offline to accomplish this task is not good at all! This costs time which in turn costs money.

With regards to allocation sizes (also known as block size and in the Microsoft world as cluster size) is the file system’s minimum size of a unit partition to which data will get written to. To clarify let us say you are using a block size of 4 KB and you write 2 MB of data (2097152 / 4096), you will be using 512 of 4 KB blocks to store that data. On the other hand, if you write 579 bytes of data, that will use 1 of 4 KB block(s) and (unless tail packing is supported in the file system which is not provided in NTFS) the rest of the 4 KB data block will be wasted until that file grows in size or is deleted to eventually have that region written over with something larger. As of NT 3.51 and later NTFS supports 512 bytes, 1 KB, 2 KB and 4 KB. For high performance computing and especially working with larger file sizes, it is sometimes needed to go larger. XFS starts at 512 bytes and can go up as high as 64 KB. ZFS can go up to 128 KB. These high numbers do increase performance when working with large files. It also presents the volume with higher capacity range as less meta-data is utilized.

If I really had the time, this list could go on, but I wanted to shed some light on one last point and that is Volume Mount Points. In a Microsoft environment you are limited to the total numbers of the English alphabet. This is not the case with POSIX-like platforms. Again, in an enterprise environment, there may be a need to handle multiple volumes, things can get a bit problematic if the Microsoft server is attempting to serve more than 26 (but if you count the fact that A, B, and C drives are by default taken by the floppy and root operating system mount devices, then you have 23 left for CD/DVD, tape and other disk media devices.

Don’t get me wrong, just like anything other file system, NTFS can be tuned for better results. You can get better performance out of any file system, if you can search out the proper documentation for it and the documentation is all over the internet.

With these limitations well known, then why do we still try to deploy Microsoft Windows in environments it was not suited for? The answer is familiarity. Microsoft for the most part owns the client/end-user market and with that the end-user has gotten too familiar and too comfortable with its platform. In turn what was built for home (and to an extent small business) use has leaked into an environment where it is not ready for. Please understand that I am not trying to preach against Microsoft and attack them. As many others in the high performing server/storage industry I have come to understand where certain problems originate from and that includes the limitations of the Windows platform. If you, the reader, feel something different with Microsoft and their role in enterprise class computing please feel free to comment. I know that I may not always be correct in my viewpoints and if you can shed any additional light I would very grateful.

Categories: File Systems, Linux, Microsoft, Storage, UNIX Tags:

The SUN has set and you can barely see it through WINDOWS.

December 22nd, 2008 5 comments

Lately I have been reading a lot of articles on the current happenings of Sun Microsystems. For years now, I have always had a soft spot for Sun Microsystems and despite all the struggles that they have been through, deep down I can feel that their end is approaching. If not the end, at least a complete remodel of their company will be taking place to cater to their recent investments in technology.

Now before I continue with this I wish to highlight a few statistics: recent reports have shown that year-after-year the Linux Operating System has taken a firm hold in the overall enterprise market. As of 2008 it had been holding a share of at least 13.4% (according to recent IDC reports). UNIX Operating Systems on the other hand report 32.7% usage in the same market. Combined (46.1%), Linux and UNIX out-weigh Microsoft’s share of 36.5%. I am filled with joy when I see that every year Linux increases its share by an average of 10%. With leaders such as Red Hat, Novell and Canonical, I see an even stronger future for the Linux Operating System.

Sun’s recent struggles have clearly shown the strong influence and adoption of open source software and they never displayed any issues to admit this and conform to it. While these statistics show the operating platforms within the enterprise industry it does not give us any hints on the percentage in usage of open source application running on these platforms and within these architectures.

As of 2007 (coincidently after the announcement of ZFS, dtrace and openSolaris), Sun’s stock prices and numbers have declined at a rapid rate. Revenue goals are not being matched and to help boost the usage of Sun products, more focus has been given to the open source community. Sun acquired mySQL and have been doing very well with the Open Office suite. ZFS and dtrace have been open sourced along with the Solaris Operating System. In desperation they began adopting open source so that Sun can still remain as an entity if not anything else. As long as users use mySQL or Open Office there will still be a need for Sun Microsystems. But at what cost? The focus has been dramatically shifted and emphasis has been placed on the cheaper and scalable Intel architecture as opposed to SPARC. In November of 2008 the company announced that another 5000-6000 positions will be cut. Is this giving us a glimpse of the future of Sun? Are they reshaping their company for a newer business model?

In the past couple of years their marketing has been doing nothing but driving features. Sun has barely been staying afloat because of their pushing of features and they have not held back on attacking others in the process. In the blogs of Sun employees I have seen the majority of attacks directed towards the Linux Operating System; most of which are misinforming cheap shots.

Why use anything else when you can use ZFS? Honestly, Linux has made great strides in providing enterprise class file systems/volume managers with emphasis on high availability/performance and they continue to do so. This includes device-mapper/LVM2 and the widely hyped btrfs. Note that I have seen a project to port device-mapper to openSolaris. And dtrace? Systemtap is a much worthy opponent for the Linux platform.

How long will this strategy allow them to survive in what has apparently become the jungle of Linux and Windows? Windows itself suffers from many flaws that prevent it from being a true enterprise class solution. Its I/O subsystem suffers dramatically with regards to performance and how I/O requests are handled. The NTFS file system is also one of the most horribly designed and non-scalable file systems out there. It was designed for client usage. Given time, I still see the growth of the Linux Operating System in the enterprise environment.

Categories: File Systems, UNIX Tags:


December 12th, 2008 Comments off

Hello all. My name is Petros Koutoupis, owner of Hydra Systems, LLC. This blog has been started to keep all those interested in my projects up to date with project status and details.

The reader will also get the pleasure of seeing me rant about random stuff relating to the technological world every now and then.

Categories: Misc Tags: