Kernel Log - Coming in 3.7 (Part 1): Filesystems & storage
by Thorsten Leemhuis
Linux 3.7 introduces a range of Btrfs performance improvements. The kernel now supports the SMB data exchange protocol that recent Windows versions use, and it offers discard functionality for software RAIDs, which is important for SSDs.
Last weekend, Linus Torvalds released the fifth release candidate for Linux 3.7; he was happy to point out that only a few, mostly minor, changes were submitted for this RC. As usual, Torvalds and his fellow developers had already incorporated all major new features at the start of the Linux 3.7 development cycle. Since it is rare that further major changes are integrated during the stabilisation phase, the Kernel Log can already provide a comprehensive overview of the most important advancements in the new Linux version, which is expected to arrive in early to mid-December.
The overview will be presented in the customary series of articles that will cover the various Linux areas. The first article below describes the most important new features in the kernel's filesystems and storage hardware support; subsequent articles will discuss the kernel's graphics drivers, network support, architecture code and other hardware drivers.
Btrfs
According to the developers, an optimisation for the Btrfs fsync code improves write performance, especially that of virtual machines, when VM images are located on Btrfs filesystems and the guest's software frequently requests immediate data writes via fsync. Commenting on the modification, the developer, who works for Fusionio, noted that in Fio benchmark tests with distributed data writes that each ended in fsync, data throughput with a SATA drive increased from 82 to 140KB per second. The numbers are quite low because, after each of the small random writes, the benchmark waited till the data was actually written. With an unspecified "Fusion drive" (likely a Fusion IO Drive – a PCIe device with flash memory chips), throughput reportedly increased from 431 to 2,532KB/s.
This modification is the basis for another change that improves fsync performance with synchronous writes: in tests the developer ran with the "dd" program, the throughput of a SATA drive increased from 104 initially to 121KB/s; on a ramdisk, Btrfs apparently completed the test many times faster than before. An fsync code modification for Btrfs that was introduced in Linux 3.5 has been reverted because sysbench results showed that performance dropped from 39 to 24MB/s on the developer's test system.
Like Ext4 and other filesystems before it, Btrfs can now deallocate memory areas within files. This "hole punching" technique is of interest, for example, for virtualisation software because it allows the host's filesystem to deallocate space when the files that used that space have been deleted in the guest. In addition, there are quite a few bugfixes to the send/receive code that was introduced in Linux 3.6. With Btrfs in Linux 3.7 it is now possible to have not only have 20, but up to 65,536 hardlinks to one file. Chris Mason, the maintainer of this still experimental filesystem, lists various other changes to Btrfs in the email for his main GIT pull request
Filesystems
- Ext4 can now resize on volumes of more than 16TB in size. Theodore "tytso" Ts'o also notes in the email for his main GIT pull request that resize operations have generally become faster.
- The CIFS (Common Internet File System) filesystem that gives access to Windows and Samba shares now supports SMB (Server Message Block) 2.0, which was introduced with Windows Vista, as well as its Windows 7 descendant, SMB 2.1. The code is still classified as experimental; some code fragments have been part of the kernel for some time, but were marked as "broken" and were, therefore, normally unusable.
- The NFS 4.1 support is no longer classified as experimental.
Storage
- The MD software RAID code of Linux 3.7 can now use discard to inform the devices in a RAID array of newly deallocated storage areas, which is relevant for SSDs and thin provisioning (1, 2, 3, 4, 5). The NBD (Network Block Devices) code can also now communicate deallocated storage areas via discard.
- In ATA devices, the
cache_type
sysfs device file (for example/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:1/0:0:1:0/scsi_disk/0:0:1:0/cache_type
) can now be used to switch between the write-through and write-back cache behaviours.
- The Libata subsystem supports "Aggressive SATA device sleep", a power-saving mechanism that is specified in the AHCI 1.3.1 Technical Proposal and can reduce power consumption in systems with SATA disks.
- The qla4xxx SCSI driver can now handle the Qlogic 8032 (ISP83XX), and Virtio-Scsi supports the resizing of storage devices.
- The block layer offers the "WRITE SAME" command that allows a data packet to be transmitted once and then written to all specified IO blocks. This provides an easy and efficient way to perform tasks such as initialising RAIDs or overwriting entire storage devices.