Trees or lists
The dir_index
feature, sometimes also called htree
, is particularly important for file system performance. Ext2 originally stored the file names within a directory as a linked list. While this is a elegantly simple data structure it has the disadvantage that operations take longer and longer with a growing number of entries. In a file system comparison carried out five years ago we found that the performance of Ext3 plummets drastically even when there are only a few thousand files within a directory.
Since then, like its file system competitors ReiserFS, XFS and JFS, Ext3 has learnt to manage directories in tree structures. If dir_index
is set, it drastically speeds up directory operations. Performance loss is only experienced when the directories are filled with hundreds of thousands of files. This is usually caused by a caching effect; The smaller the main memory, the earlier a performance loss occurs. It seems, that the Linux kernel doesn't use unlimited memory for caching directory structures: In single user mode, no further increase in performance was noted when RAM was stepped up from 1 to 2 GB. In normal operation with a series of additional processes, however, we could still see a difference between 1 and 2 GB.
The dir_index
option can also be enabled after creating the file system using the
tune2fs -O dir_index
command, although it will only be applied to newly created directories. The existing directories of an unmounted file system can be changed to dir_index
using the
e2fsck -fD
command after the dir_index
feature has been enabled. For safety reasons, this should be followed by another forced e2fsck run by including option -f.
This option is also useful when files are continually created and deleted within directories: Since Ext3 doesn't remove the names of deleted files from the directory files the latter keep growing - even if most entries are no longer in use. e2fsck -fD removes invalid entries from the directory files and recreates the file name tree, which can considerably increase the speed of directory operations in large directories.
Journalling
The main difference between Ext2 and Ext3 is the journal we previously mentioned. The idea behind it is simple: A change within the file system, such as when a new file is created, has effects in many places - a new directory entry and a new inode are created, data blocks and inode are tagged as reserved in the block and inode bitmap, the last access time changes in the directory inode, the file system statistics in the superblock are updated and the data itself is written. If there is a power loss between the various write operations in the different data structures, or if the system crashes, the file system becomes inconsistent - there may, for example, be an inode without a corresponding directory entry, resulting in an unnamed file in lost+found after an e2fsck run.
To prevent this, Ext3 initially writes the changes to its journal. Until all the relevant changes (a transaction) have been entered into the journal, the file system's (old) metadata remains intact. Once the transaction is complete, the (new) metadata is consistent in the journal and can be transferred to the file system at the next opportunity. If there is a crash, e2fsck simply needs to retransfer the completed transactions in the journal to the file system to ensure continuity. Incomplete transactions in the journal are ignored because in this case, the (old) data is still valid in the file system.
Ext3 has various different journal operating modes which can be selected via
mount -O data=MODE
when mounting. The default mode is ordered: Ext3 initially writes the data to disk, before the altered metadata is entered into the journal but this only ensures the consistency of the metadata. If the computer crashes or there is a power cut before the transaction in the journal is complete, data which has already been written is lost, since the newly allocated blocks have not been assigned to their inode and haven't been tagged as allocated in the block bitmap.
The option data=journal
causes the data itself to pass through the journal, but this drastically decreases file system performance - the entire data set needs to be written to disk twice. In with the writeback
option, the metadata in the journal can be written before the actual data. This can increase performance slightly since Ext3 can optimise write accesses better this way; if there is a crash, however, old data may appear in apparently newly created files, and files which were supposedly created correctly may be empty.
To increase performance the journal can be stored on a different disk than the file system - this allows simultaneous access to both file system and journal. To do this, the external journal first needs to be created with
mke2fs -O journal_dev DEVICE
and then
mke2fs -J device=DEVICE
has to be called.
There is another difference between Ext2 and Ext3: The two file systems delete files in different ways. While Ext2 only enters the deletion time into the inode (and marks data blocks and inode as available in the block and inode bitmap), Ext3 also deletes the block numbers in the inode. This makes returning the file system to a consistent state easier after a crash, but it also causes Ext2 programs for recovering deleted files, the lsdel command in debugfs and special undelete tools, to not work with Ext3.