Kernel Log - Coming in 3.8 (Part 2)
Infrastructure
by Thorsten Leemhuis
Users not in possession of root privileges will in future be able to set up containers in which to run software which requires them. Linux 3.8 will, under certain circumstances, use less memory and will include improvements for NUMA systems. Support for Intel 386 chips has been dropped.
Linux 3.8 more or less completes the major restructuring around user namespaces, driven forward over the last few months by Eric W. Biederman. These changes will enable non-privileged users to create an isolated space within which they will have root privileges. Within this space, however users will not be able to do anything which would require privileged actions to be taken outside of the namespace.
The changes are of interest for areas such as container virtualisation, allowing users to run software which requires root privileges without endangering the rest of the system. The work carried out means that it will even be possible to nest this capability, enabling users to set up a further isolated space within a container; this could prove useful in applications such as virtual servers provisioned by hosting companies using container technologies. Details of user namespaces and other namespace technologies can be found in a series of articles on LWN.net, of which there are currently four parts (1, 2, 3, 4). This restructuring work is not completely finished, which means user namespaces can only be activated during the configuration of Linux 3.8 if various filesystems (including Ceph, CIFS, GFS2, NFS, OCFS2 and XFS) are omitted.
Resource controls
The kmem extension has been added to the memory cgroup controller (1, 2 and others, documentation). It can be used to limit the amount of memory required by the kernel for managing processes. Using appropriately configured limits, this should enable the kernel to keep the lid on fork bombs even within containers.
Following lengthy, sometimes heated dispute between various kernel developers, a number of enhancements developed under the "balancenuma" umbrella have been merged into Linux 3.8 (1, 2, 3, 4, 5 and others). In the ideal case, these improvements will enable the kernel to automatically maintain coherence between the processor and memory for a single process. This kind of setup is important for optimising performance on multiprocessor systems which utilise the now widely used NUMA (non-uniform memory access), as processes on NUMA systems can access memory allocated to the processor on which they are running faster than they can access memory allocated to other processors. Linux 3.9 and later versions will include further improvements to automated placement and will build on the foundations that have now been merged.
For kernels which support transparent huge pages, Linux 3.8 can reserve a huge zero page to serve as the basis for new huge pages using COW (copy-on-write) (1, 2, 3, 4, 5 and others). This can significantly reduce memory use in cases where such memory areas are requested but never written to.
Development Status
Linus Torvalds released the sixth release candidate of Linux 3.8 at the beginning of February. In doing so, he appealed to his colleagues to submit more patches for meaningful bugs. Torvalds explained that he is working, as usually the case, to make the seventh RC the last one in this cycle. However, he did reserve the right to move to an RC8 if needed. Linux 3.8 RC7 was released on 9 February without any indication whether it will be the last release candidate in the series.
Since new release candidates are released approximately once every week, Linux 3.8 can be expected by the middle of February or the week after. That is, if Torvalds isn't scheduling another RC.
CPU and platform code
On more recent Intel processors, the diagnostic tool turbostat, which is included in the 3.8 kernel, can query how many watts are approximately being consumed by individual graphics and processor cores and by the processor as a whole.
A balloon driver for Hyper-V has been merged into the kernel. The Microsoft hypervisor can, at runtime, temporarily transfer memory from Linux guests running this driver to the host and subsequently return it. The ability to enlarge the amount of memory above the value specified when the VM is launched is, however, still not supported.
The new UEFI variable filesystem (efivars) is now able to read and set EFI variables with sizes larger than 1KB. Version 18 of Fedora already uses this filesystem, usually mounted at /sys/firmware/efi/efivars/, to exchange data with the MOK support in the secure boot loader shim.
Goodbye 386
To simplify maintenance and ongoing development, the kernel developers have removed support for Intel 386 CPUs and other CPUs making use of the architecture which was extremely widespread more than 20 years ago. 486 and newer x86-32 processors will continue to be supported. It was on a 386 that Torvalds first started developing Linux, but he will not be shedding any tears for the code, commenting: "I'm not sentimental. Good riddance."
Networking
The Berkeley packet filter (BPF), used by sniffer tools such as tcpdump, can now be used to filter VLAN tags. The addition of support for hardware-offloaded encapsulation means that drivers can now offload calculation of checksums for encapsulated packets to network chips, thereby reducing CPU load. The network packet scheduler QFQ (quick fair queuing) has been extended and renamed QFQ+ and should now be faster and fairer. The developer behind the change explores some of the details in a paper entitled Reducing the Execution Time of Fair-Queueing Schedulers� .
Scheduler
Changes to the scheduler mean that additional factors will now be taken into account when calculating CPU load, improving the accuracy of such calculations. A more accurate figure is important for improving load balancing and the alteration should also improve the operation of several changes which mean that the kernel tries to bundle many small processes onto a single processor core to allow other cores to sleep. Details of the changes and the benefits they bring can be found in an article on LWN.net. The site also has an article on changes to the code for RCU (read-copy-update) blocking aimed at reducing jitter and latency during scheduling.
Miscellaneous
- Some key PCI parameters for single root I/O virtualisation (SR IOV) can now be read and set via sysfs (1 and others, documentation).
- Where the hardware supports it and the kernel has been appropriately configured, the #0 CPU, on which the kernel was booted, can now be hot-plugged (1, 2 and others).
- When overriding firmware ACPI tables, as well as precompiled tables, the kernel can now also use tables stored in initramfs (1 and others, documentation). Many Linux users would have welcomed this function many years ago, as in the past many laptops could only be used sanely with modified ACPI tables. This function was previously omitted due to the concerns of some kernel developers, and there are likely to be far fewer users now, as ACPI support tends to cause fewer problems these days.
- An overview of the main improvements to the tracing code can be found in the main git merge for the perf subsystem.
- Linus Torvalds has not merged a change submitted by Kees Cook which would have made the CONFIG_EXPERIMENTAL configuration option invisible, as it is considered no longer relevant. The change also included dozens of patches to remove dependencies on this option for existing configuration options. More than three dozen such patches have nonetheless been merged into the Linux 3.8 kernel, as the maintainers of some subsystems have picked up on Cook's changes. Due to this, some experimental kernel functions will, from now on, be identifiable only from the help text or from markers such as "(EXPERIMENTAL)" in the description.