Big files
The most important under-the-bonnet improvement in Ext4 is the use of extents in place of the indirect block addressing used in Ext3. In the latter, an inode stores the numbers of a maximum of up to twelve 4 KB blocks. If a file is larger than 48 KB, firstly indirectly addressed blocks (up to 4 MB), then double (up to 4 GB) and lastly triple indirectly addressed blocks are used, in which a block number in the inode points to a block containing block numbers which point to a block containing block numbers which point to a block containing the block numbers of the data (details on Ext3 can be found here). This classic UNIX addressing scheme has certainly proved its worth for small, highly fragmented or sparse files, but carries an increasing administrative overhead when dealing with large files.
Extent data structure
struct ext3_extent {
__u32 ee_block;
/* first logical block
extent covers */
__u16 ee_len;
/* number of blocks
covered by extent */
__u16 ee_start_hi;
/* high 16 bits of
physical block */
__u32 ee_start;
/* low 32 bits of
physical block */
};
Extents: a 12 byte data structure managing up to 128 MB of data.
Rather than addressing individual blocks, extents map a portion of a file (which should be as large as possible) to a range of contiguous blocks on the hard drive. Consequently, instead of lots of individual block numbers, just three values are required: the start and size of the portion within the file (both in file system blocks) and the number of the first data block on the hard drive. The data structure of an extent in Ext4 is shown in the box on the right.
Ext4 uses 32-bits to record the number of blocks within a file, which limits the maximum file size in Ext4 file systems to 232 4 KB blocks, equivalent to 16 TB. This limit may be removed when the extent format is revised – the developers are already considering using a different format, which would go back to using individual block numbers, for sparse and very highly fragmented files which can't be managed efficiently using extents. The developers are also thinking about being able to detect damaged extents using additional information and a checksum.
This may all end up being pie in the sky, nonetheless the foundation for all these developments has already been laid – a header structure on the drive at the start of the extents includes a magic number for identification purposes which allows differentiation between different extent types should this become necessary.
15 bits are available for the size value of an extent, so that an extent cannot be larger than 215 4 KB blocks, equivalent to 128 MB. There is a simple reason for this limit – like Ext3, Ext4 divides the hard drive into 128 MB block groups. Because each block group is preceded by a block group descriptor and an extract from the inode bitmap, the block bitmap and the inode table, it is not possible to store more than 128 MB in one go.
The remaining, sixteenth, extent size bit records whether or not data has already been written to the extent – if not, the file system just returns null values when an attempt is made to read the data (uninitialised extent flag). This allows applications to preallocate drive space ("persistent preallocation" – on which more below), one of many performance enhancement measures in Ext4, whilst ensuring that this does not allow access to data previously stored in the allocated area.
Extents offer benefits for large files in particular and especially for operations which require primarily metadata operations, such as deleting and truncating large files. This is immediately apparent when you create and then delete a few large files using dd (see table). Deletion in particular requires just a fraction of the time under Ext4.
Performance with large files | |||
Ext31 | Ext41 | Improvement | |
Creation of eight 1 GB files | |||
Time | 155.9 s | 145.1 s | 6.9 % |
Write speed | 55.4 MB/s | 59.3 MB/s | 7.0 % |
Deletion of eight 1 GB files | |||
Time | 11.87 s | 0.33 s | 97.2 % |
10,000 random read and write operations in 8 GB | |||
Operations/s | 80.0 | 88.7 | 10.9 % |
1 Mount option: noatime; single user mode; in each case file system newly mounted |
Extent trees
Ext4 uses the 60 bytes in the inode which Ext3 uses to store 15 32-bit block numbers to store four extents and one header extent, each 12 bytes in size. This allows files of up to 512 MB to be managed directly from the inode. This also illustrates another, very practical advantage of 48-bit block numbers – if 64 bits were used for both the position of the extent within the file and the start block, the size of an extent would increase to 18 bytes. Since the extent header occupies 12 bytes, this would allow just two, as opposed to four, extents to be stored within the inode.
If a file is larger than 512 MB, Ext4 builds an extent tree. An additional data structure is used in this case: the extent index, which contains just the start position of the extent within the file and a block number on the hard drive. This data block can in turn contain either extents pointing to data or more extent indices, with each block beginning with an extent header. The extent tree starts with an extent index in the inode.
Extents deal with two problems present in Ext3. They reduce the management overhead for large files – it is more efficient to manage a 500 MB file using four 12 byte extents in the inode than half a megabyte of 32-bit block numbers spread across the hard drive – and can prevent fragmentation of the file system. To achieve this, the developers have implemented a number of new mechanisms.
One of these is "persistent preallocation", discussed briefly above. The fallocate()
call allows an application to reserve a specified amount of space for a file and thus tell the file system how large the file is going to be. This is particularly useful where a file grows slowly or is not written sequentially, as, for example, happens with some file-sharing applications.
Thanks to persistent preallocation, Ext4 can reserve sufficient space on contiguous (or as near contiguous as possible) parts of the disc in advance. A fortuitous by-product for users is that actually writing the data to the disk can no longer fail due to lack of space. Fallocate()
is not yet available in Glibc; applications have to call the function using either syscall()
or posix_fallocate()
. The sixteenth extent size bit (discussed above) records whether an extent has been preallocated but not loaded with data.
Next: Enhanced to within an inch of its life?