Kernel Log: Coming in 2.6.34 (Part 2) - File Systems
by Thorsten Leemhuis
Version 2.6.34 of the Linux kernel will be the first to support the Ceph and LogFS file systems. A number of changes to the Btrfs and XFS code promise improved performance. The kernel should now be better at working with drives with 4 KB logical sectors.
On Tuesday morning, Linus Torvalds released the fifth pre-release version of Linux 2.6.34. One feature highlighted in Torvalds' release e-mail was a fix for a problem in the ACPI subsystem which had afflicted several testers.
Following the teething troubles at the start of the 2.6.34 development cycle, RC5 was released, as is usual at this point in the development cycle, around one week after RC4. The latter had been delayed as Torvalds and other developers spent days hunting down a bug in the kernel's memory management code. This gave rise both to a lengthy discussion and to fixes for several other bugs uncovered during the search. Details of what went on and an in-depth look at parts of the virtual memory subsystem can be found in the article "The case of the overly anonymous anon_vma" on LWN.net.
In tandem with the ongoing development of Linux 2.6.34, Kernel Log will continue to report on major changes in the new kernel version, which is scheduled for release in May. Following on from part one of the 'What's coming in 2.6.34' series, which dealt with networking-related changes, part two deals with changes relating to file systems. This will be followed in the coming weeks with articles on graphics support, architecture code, drivers and a number of other areas.
Two new file systems
The inclusion of Ceph and LogFS in 2.6.34 further expands the number of file systems supported by the Linux kernel.
Ceph (git pull request) is an experimental, distributed, replicating network file system for clusters licensed under the LGPL. According to the developers, it is suitable for managing data volumes in the petabyte range "and beyond", is already pretty stable, and offers numerous features missing from comparable open source file systems. These include the ability to expand the file system simply by adding additional servers, with Ceph automatically distributing data across the new servers. It also aims to increase data throughout by automatically redistributing data. Although it uses parts of the still experimental Btrfs file system code for storing data, Ceph, which started out as a research project at the University of Santa Cruz' Storage Systems Research Center, should already be usable – though the developers do strongly advise users to back up important data.
The article "Ceph: The Distributed File System Creature from the Object Lagoon" in Linux Magazine (a US publication), by Dell employee and high performance computing specialist Jeffrey B. Layton, offers a good overview of Ceph. More information on the file system can be found in the brief description of Ceph integrated into the file system code (distributed over more than 200 commits) and an old article on LWN.net, which describes an earlier Fuse-based version of the file system.
The second new file system is again something with which the majority of users are unlikely to come into direct contact in the near future – LogFS uses log structures and is of primary interest for flash drives with no wear levelling, used in embedded systems. Roughly speaking, the file system, which was largely developed by German developer Jörn Engel, does exactly what the firmware for solid state disks (SSDs) with SATA connectors does. It is, however, unlikely to be useful for desktop SSDs, as the Flash Translation Layer (FTL) and the file system get in each other's way. Background information on LogFS can be found in an LWN.net article from 2007 and in the LogFS documentation in the kernel source code.
Tuning
Btrfs maintainer Chris Mason briefly elucidates some of the major changes in Btrfs (which were not merged into the main development tree until after the merge window had closed) in his git pull request. In future it will be possible to specify which sub-volume should be mounted by default where no volume is explicitly nominated. This should be useful for distributions which generate a Btrfs snapshot before installing updates, in order to allow users to roll back to a previous state in the event of problems. Red Hat developers are working on a similar function for Fedora 13. A rewrite of the defrag code in Btrfs not only fixes various bugs, it also allows compression of selected files on otherwise uncompressed volumes. It is also now possible to defragment parts of a file. There have been additional enhancements to allow updated files to be located more quickly.
Various changes aimed at adding support for LZMA compression to SquashFS (used for Live CDs) proved not to be to Torvalds' taste. They will now be rewritten, but this could take some time. Some of the underlying functions for LZMA and LZO compression and several other changes to SquashFS were, however, merged into the main development tree.
Various enhancements to the XFS code should increase data throughput for some tasks. Details of these and other changes can be found in the February and March XFS status updates. Exofs (Extended Object File System), a file system designed for OSDs (object-based storage devices) which was merged into 2.6.30, now supports RAID 0, with RAID 5 and 6 in the pipeline.
The NILFS2 file system can now send discard commands, which enables it to, for example, tell SSDs about free areas. A change to the partitioning code should allow the Linux kernel to work with hard drives with both a physical and logical 4 KB sector size. According to both commit comments and a recent discussion on LKML, Western Digital is planning to introduce just such a hard drive in order to get around the 2 terabyte partition limit on hard drives partitioned using MBR.
Minor gems
Many minor, but by no means insignificant, changes can be found in the list of commit headers below. The headers, as well as many of the links above, link to the web interface for the git tree containing the kernel source code on kernel.org, which is maintained by Linus Torvalds. Commit comments and patches themselves generally provide extensive additional information on changes.
Btrfs
- Btrfs: add a "df" ioctl for btrfs
- Btrfs: add ioctl and incompat flag to set the default mount subvol
- Btrfs: add search and inode lookup ioctls
- Btrfs: cache the extent state everywhere we possibly can V2
- Btrfs: kill max_extent mount option
- Btrfs: make df be a little bit more understandable
- Btrfs: make subvolid=0 mount the original default root
- Btrfs: run the backing dev more often in the submit_bio helper
Ext-Family
- ext4: Add new tracepoint for jbd2_cleanup_journal_tail
- ext4: Add new tracepoints to debug delayed allocation space functions
- ext4: deprecate obsoleted mount options
Others:
- 9P2010.L handshake: Add mount option
- 9p: documentation update
- add several pieces to shared subtree documentation
- CIFS: Add mmap for direct, nobrl cifs mount types
- doc: add the documentation for mpol=local
- Documentation/fs/: split txt and source files
- exofs: Define on-disk per-inode optional layout attribute
- exofs: groups support
- fs/9p: Add hardlink support to .u extension
- FS-Cache: Remove the EXPERIMENTAL flag
- net/9p: Add multi channel support.
- nfsd: 4.1 has an rfc number
- ocfs2_dlmfs: Add capabilities parameter.
- ocfs2/userdlm: Add tracing in userdlm
- quota: split out compat_sys_quotactl support from quota.c
- reiserfs: properly honor read-only devices
- Remove EXPERIMENTAL from NFS_FSCACHE
- Squashfs: add a decompressor framework
- Squashfs: add decompressor entries for lzma and lzo
- xfs: Add trace points for per-ag refcount debugging.
- xfs: add tracing to xfs_swap_extents
- xfs: implement optimized fdatasync
- xfs: Non-blocking inode locking in IO completion
- xfs: Use delayed write for inodes rather than async V2
For other articles on 2.6.34 and links to the rest of the "Coming in 2.6.34" series, see The H's Kernel Log - 2.6.34 Tracking page. Older Kernel Logs can be found in the archives or by using the search function at The H Open Source. New editions of Kernel Logs are also mentioned on Identi.ca and Twitter via "@kernellog2". The Kernel Log author also posts updates about various topics on Identi.ca and Twitter via "@kernellogauthor".