Ext4 data loss; explanations and workarounds
Last week's announcement of possible data losses with Ext4 caused consternation and heated discussion. One user of an alpha version of Ubuntu 9.04 with Ext4 lost a large amount of data on a system crash immediately after starting KDE 4. Following a reboot, almost all the files written just before the crash showed a size of 0 bytes and were empty. He complained that nothing like that had ever happened to him with Ext3.
What had happened? When applications want to overwrite an existing file with new or changed data (for example, a configuration file after the user has changed a setting) frequently they first create a temporary file for the new data and then rename it with the system call - rename(). The logic behind this is that, if something goes wrong during the write process, say the computer crashes or there's a power failure, at least the old version of the file will be retained.
The process involves two things. First, metadata in the file system is changed. An inode is created for the new file that references the data, and a new index entry is generated that points to the new inode. After a rename(), the index entry of the old file is changed so that it points to the new inode. Second, the data itself is written. To do this, the filing system must first allocate a sufficient number of data blocks on the disk and then write the data to those blocks.
Both Ext3 and Ext4 first write all the changes to metadata in their journals, so even after the rename() nothing has actually been changed in the file system itself. If the power fails at this point, the new file doesn't yet exist in the filing system. The index entry of the old file points to the old inode and thus to the old data, and the changed metadata in the journal is not yet valid. A "commit" of the changes is required in the journal to make those changes valid. The file system only transfers the changed metadata to the file system some time after that (or on the next reboot after a crash).
But there's a crucial difference here between Ext3 and Ext4. Ext3 (with the standard mount option "data=ordered") only updates the metadata in the journal with that commit when the data of the new file has actually been written to the disk. This can take up to five seconds, during which time the data are still temporarily stored in the cache. This is meant to prevent old data turning up in a newly created file following a system crash. It is possible that the allocated data blocks were previously occupied by a file that had meanwhile been deleted and had not yet been overwritten with the new data. So, after a system crash, the file contains either the old or the new data, depending on whether the crash took place before or after the commit.
Ext4, on the other hand, has another mechanism: delayed block allocation. After a file has been closed, up to a minute may elapse before data blocks on the disk are actually allocated. Delayed block allocation allows the filing system to optimise its write processes, but at the price that the metadata of a newly created file will display a size of 0 bytes and occupy no data blocks until the delayed allocation takes place. If the system crashes during this time, the rename() operation may already be committed in the journal, even though the new file still contains no data. The result is that after a crash the file is empty: both the old and the new data have been lost.
Ext4 developer Ted Ts'o stresses in his answer to the bug report that Ext4 behaves precisely as demanded by the POSIX standard for file operations. Other file systems, such as XFS, display the same behaviour: the "safer" behaviour of Ext3 is only an unintended side effect. For Ts'o, the problem lies with application developers who take the forgiving behaviour of Ext3 to be a standard. He advises that, if an application wants to ensure that data have actually been written to disk, it must call the the function fsync() before closing the file.
Nevertheless, as a workaround, he very quickly wrote patches for Ext4 that recognise the rename() situation and ensure it behaves like Ext3, and a second procedure that uses ftruncate(). Now, however, he has provided a "proper" solution. The new ext4 mounting option "alloc_on_commit" gives Ext4 a mechanism analogous to "data=ordered" in Ext3, whereby metadata is not committed in the journal until after blocks have been allocated and the data has been written. However, this change probably won't arrive until version 2.6.30 of the kernel at the earliest.
(odi/heise open)
(djwm)