Linux 2.6 kernel Storage Tuning Tips
There are certain topics that never cease to amaze me when I work closely with storage administrator to even developers and QA engineers. Some of those topics are very specific to host side storage tuning. That is, there have been many occasions when certain knowledge in the storage industry has never been acknowledged and taught. Eventually bad practices develop which can eventually lead to disastrous results. It becomes even worse when you get into operating platforms that many may not necessarily be accustomed to such as Linux and UNIX. This blog entry focuses on some SCSI Subsystem details for the Linux platform.
A Closer Look at the Linux 2.6 SCSI Layer:
In Linux, the SCSI Subsystem exists as a multi-layered interface divided into the Upper, Middle and Lower layers. The Upper Layer consists of device type identification modules (i.e. Disk Driver (sd), Tape Driver (st), CDROM Driver (sr) and Generic Driver (sg)). The Middle Layer’s purpose is to connect both Upper and Lower Layers and in our case is the scsi_mod.ko module. The Lower Layer is for the device drivers for the physical communication interfaces between the host’s SCSI Layer and end target device. Here is where we will find the device driver to the HBA. Reference image below:
Whenever the Lower Layer detects a newer SCSI device, it will then provide scsi_mod.ko with the appropriate host, bus (channel), target and LUN IDs. Depending on what type of media the devices are would determine what Upper Layer driver will be invoked. If you view /proc/scsi/scsi you can see what each SCSI device’s type is:
The Direct-Access media type will utilize the sd_mod.ko while the CD-ROM media type will utilize the sr_mod.ko. Each respective driver will allocate an available major and minor number to each newly discovered and properly identified device and on the 2.6 kernel, udev will create an appropriate node name for each device. As an example, the Direct-Access media type will be accessible through the /dev/sdb node name.
When a device is removed, the physical interface driver will detect it from the Lower Layer and pass the information back up to the Upper Layer.
There are multiple approaches to tuning a SCSI device and the more complex approach involves the editing of source code and recompiling the device driver to have these variables hard-coded during the lifetime of the utilized driver(s). That is not what we want, we want a more dynamic approach. Something that can be customized on-the-fly. One day it may be optimal to configure a driver one way and the next another.
Optimizing the Disk Device Variables:
The 2.6 Linux kernel introduced a new virtual file system to help reduce the clutter that became /proc (for those not familiar with the traditional UNIX file system hierarchy, this was originally intended for process information) with a sysfs file system mounted at /sys. To summarize, /sys contains all registered components to the Operating System’s kernel. That is, you will find block devices, networking ports, devices and drivers, etc. mapped from this location and easily accessible from user space for enhanced configuration(s). It is through /sys that we will be able to navigate to the disk device and fine tune it to how we wish to utilize it. After I explain sysfs, I will move onto to describing modules and how a module can be inserted with fine-tuned and pseudo-static parameters.
Let us assume that the disk device that we want to view the parameters to and possibly modify is /dev/sda. You would navigate your way to /sys/block/sda. All device details are stored or linked from this point for device node named /dev/sda. If you go to the device you can view time out values, queue depth values, current states, vendor information and more (below).
To view a parameter value you can simply open the file for a read.
Here we can see that the timeout value for the SCSI labeled device is 60 seconds. To modify the value you can echo the new value into it.
You can perform the same task for the queue depth of the device along with the rest of the values. Modifying the disk device values in this way is unfortunately not maintained statically. That means that every time the device mapping is refreshed (through a module removal/insertion, bus scan, or a reboot) the values restore back to their defaults. This can be both good and bad. A basic shell script can modify all values to all desired disk devices so that the user does not have to enter each device path and modify everything one by one. On top of the basic shell script a simple cron job can also validate that the values are maintained and if not it can rerun the original modifying shell script.
Another way to modify values and have them pseudo-statically maintained is by inserting those values within the module itself. For example if you do a modinfo on scsi_mod you will see the following dumped to the terminal screen.
The appropriate way to enable a pseudo-static value is to insert the module with that parameter:
Or modify the /etc/modprobe.conf (some platforms use an /etc/modprobe.conf.local) file by appending an “options scsi_mod max_luns=255” and then reinsert the module. In both cases you must rebuild the RAM Disk so that when the host reboots it will load max_luns=255 into the insertion of the scsi_mod module. This is what I meant by pseudo-static. The value is maintained only when it is inserted during the insertion of the module and must always be defined during its insertion to stay statically assigned.
Some may now be asking, well what the heck is a timeout value and what does queue depth mean? A lot of resources with some pretty good information can easily be found on the Internet but as far as basic explanations go, a SCSI timeout value is the maximum value to which an outstanding SCSI command has to completion on that SCSI device. So for instance, when scsi_mod initiates a SCSI command for the physical drive (the target) associated with /dev/sda with a timeout value of 60, it has 60 seconds to complete the command and if it doesn’t, an ABORT sequence is issued to cancel the command.
The queue depth gets a little bit more involved in which it limits the total amount of transfers that can be outstanding for a device at a given point. If I have 64 outstanding SCSI commands that need to be issued to /dev/sda and my queue depth is set to 32, I can only service 32 at a time limiting my throughput and thus creating a bottleneck to slow down future transfers. On Linux, queue depth becomes a very hairy topic primarily because it is not adjusted only in the block device parameters but is also defined on the Lower Layer of the SCSI Subsystem where the HBA throttles I/O with its own queue depth values. This will be briefly explained in the next section.
Other limitations can be seen on the storage end. The storage controller(s) can handle only so many service requests and in most cases it may be forced to begin issuing ABORTs for anything above its limit. In turn the transfers may be retried from the host side and complete successfully, so a lot of this may not be that apparent to the storage administrator. It becomes necessary to familiarize oneself with these terms when dealing with mass storage devices.
Optimizing the Host Bus Adapter:
An HBA can also be optimized in pretty much the same fashion as the SCSI device. Although it is worth noting that the parameters that can be adjusted to an HBA are vendor specific. These are additional timeout values, queue depth values, port down retry counts, etc. Some HBAs come with volume and/or path management capabilities. Just simply identify the module name for the device by doing an lsmod or even traverse through /sys/class/scsi_host (it may also be useful to first identify it usually attached to your PCI bus by executing an lspci). And from that point you should be able to either navigate the /sys/module path or just list all module parameters to that device with a modinfo.
Additional Topics to be Aware of:
First and foremost, dependent on the method of connection between host(s) to target(s), load balancing becomes a big problem. How do you balance the load to all Logical Units (LU) making sure that all can get serviced within an appropriate time frame? Fortunately enough, there exists some great tools for the Linux platform that many utilize for both volume management, multipathing/load balancing and failover/failback needs. One such tool is a device-mapper used in conjunction with multipath-tools. In the past I have always used this and it has always served me extremely well, but be aware, that this set of modules must also be fine tuned to accommodate the type of storage you are utilizing.
It also becomes quite necessary to understand file system basics; that is basic structures and methodologies. For each file system is unique and can offer many positives and/or negatives in a production environment. Some things to consider with the file system are journaling methods, data allocation (block-based or extent-based on top of a B+ tree), write barriers and more. In regards, to journaling methods some file systems contain more than one method for journaling, some more reliable than others but they come at a cost with significant performance drops.
In order to fully and appropriately optimize all of these variables, the administrator must fully understand the I/O profile to which they are catering to. What limitations is our storage leaving us? Would I need to increase my SCSI timeout values in order to make sure that all I/O requests are fully serviced with little or no problems? What limitations does my HBA give me? How is the host accessing the end storage device (directly or in a SAN)? What about redundancy (path failovers), to make sure that there will be little to no down time on failures? How much traffic should I expect? How do the application work with the disk device(s)? What performance gains or losses do I obtain with Volume Management? These are just a few of many pressing question that an administrator must ask themselves when configuring and placing storage into production.
Unfortunately this covers a fraction of what needs to be known to manage storage systems. Not too long ago I had written and published a PDF covering much more including the I/O subsystem, performance and development tips; please reference this PDF. Also here is a link to an excellent IBM article discussing the Linux SCSI Subsystem.