I am writing to announce the update release of my Linux RAM disk kernel module RapidDisk (
rxdsk). It is currently at a stable 1.2b release with optimizations to the configuring of the request queue of each
rxd block device and also with added checks for the module to build from kernels 2.6.32 all the way to the latest (currently 3.0.3); this includes the addressing of the deprecated command
blk_queue_ordered() found in 2.6.37 and later. More information can be found at http://rxdsk.petroskoutoupis.com.
I am writing to announce the update release of my Linux RAM disk kernel module RapidDisk (
Last week, I came across a tutorial about tweaking a specific parameter in the Linux virtual memory subsystem. So I figured that I would share all of the optimizations that I usually go through in a new installation of Linux.
Adjusting swap parameters
As the tutorial highlighted, minimizing the
vm.swappiness value is a good start. Realistically I do not understand why it defaults to 60. The lesser the value, the more memory is used before swapping processes to disk begins. That is, 0 will use all memory before swapping begins. Nowadays, most PCs come with more than enough memory, so why the need to swap so early. Setting a value such as 10 on a system that contains 1GB or more of RAM, should be fairly reasonable. I can only imagine the performance hit of write operations to SSDs. Just the amount of time to modify each block of each page with the read-modify-erase-rewrite mechanism SSDs utilize when writing data to NAND cells. Swapping can really bring down the performance of such a high-speed technology while also hurting the limited cell life, despite modern algorithms for wear-leveling.
While the highlighted article will pretty much state the same thing, you just need to modify the
/etc/sysctl.conf file and append
vm.swappiness=10. On reboot it will take affect when
sysctl is launched during
init; but you can quickly apply it by typing the following at the command line:
$ sudo sysctl -q -p
Another thing I usually do for personal computing PCs is that I mount all file systems (that support it), with atime disabled. Atime is a file’s last accessed time. In most cases this is an unnecessary update to the metadata associated with a recently accessed file. Again, this is a benefit to SSD accessibility as it is less data that needs to be written to the storage device. Here is a good example of how my
/etc/fstab file looks like (I do apologize for the misalignment):
UUID=5bc12928-9e8f-4413-9f20-6d5bcd107881 / ext4 errors=remount-ro,noatime 0 1
/dev/sda1 /boot ext4 noatime 0 2
UUID=50f38470-810a-4145-ab0a-5e3152ced335 /usr ext4 noatime 0 2
Under options I know that I do not care about access time, so there is never a need to constantly update that metadata for each file touched which would normal result in increased hard drive usage or SSD cell wear.
Caching applications to RAM
One last optimization I like to configure is caching what I can to RAM. A great example can be seen with Firefox. I use
tmpfs for this.
$ sudo mkdir /mnt/rdsk
$ sudo mount -t tmpfs -o size=96m tmpfs /mnt/rdsk/
This command creates a directory named
/mnt/rdsk and in turn use 96 MBytes of RAM for volatile disk space. The reason why I say volatile is that as soon as the file system is unmounted or the PC is rebooted or powered down, all contents disappear. The data will remain intact as long as the file system is active. Although who is to stop you from routinely backing the data up with either
rsync or some or archiving mechanism and in turn restore it when the system is back up and running?
Now why would you want to use something like this? Faster performance as you do not have to rely on slower disk device. Also in some cases there is added security. For instance, if Firefox caches its data to this RAM based file system and I shut down the PC, all of that cache which may include confidential or private information will disappear. To set something like this up you will need to modify your
/etc/fstab file and append the following line:
tmpfs /mnt/rdsk tmpfs size=96m,nr_inodes=10k,mode=777 0 0
Please reference the man page for the
mount command to know what these options mean for tmpfs. With this, everytime you reboot the PC, a 96 MBytes of tmpfs space will be mounted to
If you want to cache Firefox to this tmpfs space, then open up the web browser at type
about:config in the URL (BE CAREFUL HERE) and add a new string type of
browser.cache.disk.parent_directory with a value of
/mnt/rdsk. Restart the browser and you will notice a performance boost.
Note that using tmpfs and ramfs does not have to be limited to Firefox caching. There are numerous applications which can take advantage of this. It is just up to you to identify and decide.
One way to think of all these optimization is that they can also reduce power consumption. With less power spent to routinely spin up/down magnetic disk drives, it would make sense limit access to these devices.
Linked off of www.linuxleak.com, today I found this interesting article on “Linux Tuning The VM (memory) Subsystem.” The author also offers some suggestions for a more efficient computing environment.
To those who are interested in the topic of Linux Storage Management, planned for the 3/2009 issue and hitting the shelves July-September is my article of the same name in Linux+ magazine. I do not know how 3/2009 equates to July-September, but that is what I have been told. It is a 7 page article and gets into some great detail with storage management.
There are certain topics that never cease to amaze me when I work closely with storage administrator to even developers and QA engineers. Some of those topics are very specific to host side storage tuning. That is, there have been many occasions when certain knowledge in the storage industry has never been acknowledged and taught. Eventually bad practices develop which can eventually lead to disastrous results. It becomes even worse when you get into operating platforms that many may not necessarily be accustomed to such as Linux and UNIX. This blog entry focuses on some SCSI Subsystem details for the Linux platform.
A Closer Look at the Linux 2.6 SCSI Layer:
In Linux, the SCSI Subsystem exists as a multi-layered interface divided into the Upper, Middle and Lower layers. The Upper Layer consists of device type identification modules (i.e. Disk Driver (sd), Tape Driver (st), CDROM Driver (sr) and Generic Driver (sg)). The Middle Layer’s purpose is to connect both Upper and Lower Layers and in our case is the scsi_mod.ko module. The Lower Layer is for the device drivers for the physical communication interfaces between the host’s SCSI Layer and end target device. Here is where we will find the device driver to the HBA. Reference image below:
Whenever the Lower Layer detects a newer SCSI device, it will then provide scsi_mod.ko with the appropriate host, bus (channel), target and LUN IDs. Depending on what type of media the devices are would determine what Upper Layer driver will be invoked. If you view /proc/scsi/scsi you can see what each SCSI device’s type is:
The Direct-Access media type will utilize the sd_mod.ko while the CD-ROM media type will utilize the sr_mod.ko. Each respective driver will allocate an available major and minor number to each newly discovered and properly identified device and on the 2.6 kernel, udev will create an appropriate node name for each device. As an example, the Direct-Access media type will be accessible through the /dev/sdb node name.
When a device is removed, the physical interface driver will detect it from the Lower Layer and pass the information back up to the Upper Layer.
There are multiple approaches to tuning a SCSI device and the more complex approach involves the editing of source code and recompiling the device driver to have these variables hard-coded during the lifetime of the utilized driver(s). That is not what we want, we want a more dynamic approach. Something that can be customized on-the-fly. One day it may be optimal to configure a driver one way and the next another.
Optimizing the Disk Device Variables:
The 2.6 Linux kernel introduced a new virtual file system to help reduce the clutter that became /proc (for those not familiar with the traditional UNIX file system hierarchy, this was originally intended for process information) with a sysfs file system mounted at /sys. To summarize, /sys contains all registered components to the Operating System’s kernel. That is, you will find block devices, networking ports, devices and drivers, etc. mapped from this location and easily accessible from user space for enhanced configuration(s). It is through /sys that we will be able to navigate to the disk device and fine tune it to how we wish to utilize it. After I explain sysfs, I will move onto to describing modules and how a module can be inserted with fine-tuned and pseudo-static parameters.
Let us assume that the disk device that we want to view the parameters to and possibly modify is /dev/sda. You would navigate your way to /sys/block/sda. All device details are stored or linked from this point for device node named /dev/sda. If you go to the device you can view time out values, queue depth values, current states, vendor information and more (below).
To view a parameter value you can simply open the file for a read.
Here we can see that the timeout value for the SCSI labeled device is 60 seconds. To modify the value you can echo the new value into it.
You can perform the same task for the queue depth of the device along with the rest of the values. Modifying the disk device values in this way is unfortunately not maintained statically. That means that every time the device mapping is refreshed (through a module removal/insertion, bus scan, or a reboot) the values restore back to their defaults. This can be both good and bad. A basic shell script can modify all values to all desired disk devices so that the user does not have to enter each device path and modify everything one by one. On top of the basic shell script a simple cron job can also validate that the values are maintained and if not it can rerun the original modifying shell script.
Another way to modify values and have them pseudo-statically maintained is by inserting those values within the module itself. For example if you do a modinfo on scsi_mod you will see the following dumped to the terminal screen.
The appropriate way to enable a pseudo-static value is to insert the module with that parameter:
Or modify the /etc/modprobe.conf (some platforms use an /etc/modprobe.conf.local) file by appending an “options scsi_mod max_luns=255” and then reinsert the module. In both cases you must rebuild the RAM Disk so that when the host reboots it will load max_luns=255 into the insertion of the scsi_mod module. This is what I meant by pseudo-static. The value is maintained only when it is inserted during the insertion of the module and must always be defined during its insertion to stay statically assigned.
Some may now be asking, well what the heck is a timeout value and what does queue depth mean? A lot of resources with some pretty good information can easily be found on the Internet but as far as basic explanations go, a SCSI timeout value is the maximum value to which an outstanding SCSI command has to completion on that SCSI device. So for instance, when scsi_mod initiates a SCSI command for the physical drive (the target) associated with /dev/sda with a timeout value of 60, it has 60 seconds to complete the command and if it doesn’t, an ABORT sequence is issued to cancel the command.
The queue depth gets a little bit more involved in which it limits the total amount of transfers that can be outstanding for a device at a given point. If I have 64 outstanding SCSI commands that need to be issued to /dev/sda and my queue depth is set to 32, I can only service 32 at a time limiting my throughput and thus creating a bottleneck to slow down future transfers. On Linux, queue depth becomes a very hairy topic primarily because it is not adjusted only in the block device parameters but is also defined on the Lower Layer of the SCSI Subsystem where the HBA throttles I/O with its own queue depth values. This will be briefly explained in the next section.
Other limitations can be seen on the storage end. The storage controller(s) can handle only so many service requests and in most cases it may be forced to begin issuing ABORTs for anything above its limit. In turn the transfers may be retried from the host side and complete successfully, so a lot of this may not be that apparent to the storage administrator. It becomes necessary to familiarize oneself with these terms when dealing with mass storage devices.
Optimizing the Host Bus Adapter:
An HBA can also be optimized in pretty much the same fashion as the SCSI device. Although it is worth noting that the parameters that can be adjusted to an HBA are vendor specific. These are additional timeout values, queue depth values, port down retry counts, etc. Some HBAs come with volume and/or path management capabilities. Just simply identify the module name for the device by doing an lsmod or even traverse through /sys/class/scsi_host (it may also be useful to first identify it usually attached to your PCI bus by executing an lspci). And from that point you should be able to either navigate the /sys/module path or just list all module parameters to that device with a modinfo.
Additional Topics to be Aware of:
First and foremost, dependent on the method of connection between host(s) to target(s), load balancing becomes a big problem. How do you balance the load to all Logical Units (LU) making sure that all can get serviced within an appropriate time frame? Fortunately enough, there exists some great tools for the Linux platform that many utilize for both volume management, multipathing/load balancing and failover/failback needs. One such tool is a device-mapper used in conjunction with multipath-tools. In the past I have always used this and it has always served me extremely well, but be aware, that this set of modules must also be fine tuned to accommodate the type of storage you are utilizing.
It also becomes quite necessary to understand file system basics; that is basic structures and methodologies. For each file system is unique and can offer many positives and/or negatives in a production environment. Some things to consider with the file system are journaling methods, data allocation (block-based or extent-based on top of a B+ tree), write barriers and more. In regards, to journaling methods some file systems contain more than one method for journaling, some more reliable than others but they come at a cost with significant performance drops.
In order to fully and appropriately optimize all of these variables, the administrator must fully understand the I/O profile to which they are catering to. What limitations is our storage leaving us? Would I need to increase my SCSI timeout values in order to make sure that all I/O requests are fully serviced with little or no problems? What limitations does my HBA give me? How is the host accessing the end storage device (directly or in a SAN)? What about redundancy (path failovers), to make sure that there will be little to no down time on failures? How much traffic should I expect? How do the application work with the disk device(s)? What performance gains or losses do I obtain with Volume Management? These are just a few of many pressing question that an administrator must ask themselves when configuring and placing storage into production.
Unfortunately this covers a fraction of what needs to be known to manage storage systems. Not too long ago I had written and published a PDF covering much more including the I/O subsystem, performance and development tips; please reference this PDF. Also here is a link to an excellent IBM article discussing the Linux SCSI Subsystem.