Tuning the Linux file system Ext3
Oliver Diedrich
Ext3 is the standard file system for Linux: It is robust, fast and suitable for all fields of use. And yet Ext3 can become a performance bottleneck. Even fragmentation is an issue with Ext3.
Like its predecessor Ext2, Ext3 has established itself as the standard file system for Linux: Ext2/Ext3 offers good performance and has been more extensively tested than any other Linux file system due to its widespread use. In addition, it is so robust that most data can be rescued from a partially corrupted hard disk. This distinguishes Ext2/Ext3 from other file systems like ReiserFS, whose delicate hard disk administration structures can result in total data loss with only a few defective sectors.
In addition to its robust meta data structures, Ext3 offers a well-engineered fsck tool: e2fsck can rescue most of data from damaged file systems – we will reveal a few tricks for making the most of the tool even in particularly dire circumstances later on in this article. If e2fsck gives up, the file system is so corrupted that data can only be retrieved from individual sectors with a low level data rescue tool like dd_rescue and manually assigned to files.
When we refer to Ext3 in the following article, we are also referring to Ext2. The main difference between them is that Ext3 has a journal, which guarantees a consistent file system at all times and reduces the time needed for checking a mounted file system from several hours to a few seconds. Ext2 and Ext3 are fully compatible: An Ext3 file system can be mounted as Ext2 - in which case its journal will simply not be used - and the command
tune2fs -j
can be used to tune an Ext2 file system to include a journal.
Simple
Like all decent Unix file systems, Ext3 uses three general data structures: directories, inodes and data blocks. Directories only contain file names and the inode numbers assigned to them. Several directory entries can point to one inode. This is what is called a hard link. A soft or symbolic link is implemented as a file who's contents point to another file, rather than pointing at an inode. On the hard disk, directories themselves are stored as files which only differ from regular files because of their file type and the fact that they have contents in the required structure.
Inodes contain all the required file details apart from the name: Size, file type (regular file, directory, device file, pipe, socket or symbolic link), owner, number of hard links, access privileges and times - and the numbers of data blocks containing data. These details, apart from the data block numbers, can be retrieved with the stat
tool, and many of them also with ls
. The ls option -i
returns a file's inode number.
Inodes are stored in tables created by mke2fs in reserved areas of the file system. Symbolic links with names shorter than 60 characters also have their name stored directly in the inode - instead of the 15 block numbers of 4 bytes each (these are called fast symbolic links). In all other cases the name of the file referred to in the symlink is stored in the file represented by the inode.
Data blocks finally store the actual data. They span several sectors of 512 bytes (the smallest addressable unit on hard disks). Ext3 uses block sizes of 1024, 2048 or 4096 bytes – the required size is chosen when setting up the file system with the mke2fs formatting program of Ext3. In theory, Ext3 supports block sizes up to 64 KB, but in x86 and x64 architectures, 4 KB is the maximum: This block size corresponds to that of the kernel's memory pages in RAM, which makes paging easier for the operating system.
Large blocks simplify data administration and allow the creation of larger file systems: Ext3 uses 32-bit values to assign block numbers, which means that it can only address about four billion blocks – 4 TB at a block size of 1024 bytes, 16 TB at 4096 bytes. In addition, file system administration requires a larger portion of the available hard disk space when blocks are small.
On the other hand, large blocks can waste a lot of disk space because files always use a whole block even if they only contain a few bytes: On average, every file wastes half a block - the larger the blocks and the smaller the files, the more noticeable the effect is. This effect called internal fragmentation. Although Ext3 already contains data structures for managing several fragments within one data block, fragments in this case being residual file parts which don't use up a whole block, the feature hasn't been implemented even though mke2fs already offers the -f
parameter for it.
When formatting, mke2fs matches the block size to the size of the file system: 1 KB for up to 512 MB, otherwise 4 KB. The -b
mke2fs option, also allows block sizes to be determined manually - this may make sense when a file system is to contain mostly very small files and the space wasted with larger block sizes is an issue.
Fast
Ext3's entire setup is optimised for the typical work of applications: Read files with certain file names or write to these files. For the file system, this means that it needs to quickly find the data belonging to a file name. The whole system revolves around the inodes, which are accessible via directory entries and contain metadata and pointers to the data blocks. The reverse mapping isn't possible: To find out which file a certain data block belongs to, all the inodes have to be searched for the requested block number, and then the directories for the respective inode number. The low level debugfs tool does this with the icheck
and ncheck
commands (see addendum file system debugger).
Therefore, access to the inodes needs to be particularly efficient. The system guarantees this by writing the inodes into static tables on the disk during formatting. One consequence of this is that the number of inodes can't be altered after the file system has been set up. As every file needs to be assigned to one specific inode there can't be more files than inodes. By default, mke2fs creates one inode for every 4 KB in file systems up to 512 MB, otherwise one inode for every 8 KB.
Those who believe they know better than mke2fs, for example, because they only intend to store a few large or very many small files, can use the -i
option when calling mke2fs to introduce their own value and determine how many data bytes relate to each inode. If there are only a few inodes there can only be a few files, but this setup frees up several megabytes of usable disk space – by design, every inode uses 128 bytes on the hard disk. A large number of inodes allows users to create more files.
The mke2fs option -T type
gives access to several settings predefined in /etc/mke2fs.conf: small
(default with file systems up to 512 MB) selects a block size of 1 KB and a ratio of one inode for every four blocks, news
a block size of 4 KB at one inode per block, and largefile
and largefile4
refer to one inode for every 256 or 1024 blocks at a block size of 4 KB.
The fixed inode size of 128 bytes also allows fast access to this central data structure. By specifying
mke2fs -I inode size
when setting up the file system, users can determine a larger value which must be divisible by 128 without remainder. Ext3 can use larger inodes to store extended attributes.