Kernel Log: 15,000,000 lines, 3.0 promoted to long-term kernel
By Thorsten Leemhuis
With the merger of the first changes into Linux 3.3, the number of lines of kernel source code has passed through the 15 million mark. Maintenance of Linux 2.6.32 is set to end in one month's time, while Linux 3.0 and real-time kernels based on it will be maintained for the next two years.
With the source code for last week's release of Linux 3.2 falling just short of the 15 million line mark, the kernel finally reached this milestone over the weekend with the merger of the first changes into the main development branch which is now building towards Linux 3.3. As the kernel hit the 10 million line mark in October 2008, this implies that the Linux kernel source code has grown by more than 50 per cent over the last three years. It's worth noting that these figures do include the comments, blank lines, documentation, scripts and userland tools included with the kernel (find . -type f -not -regex '\./\.git.*' | xargs cat | wc -l
).
Criticism of this growth is only rarely aired in kernel developer circles. Linux veteran Theodore 'tytso' Ts'o recently noted on the kernel developer mailing list that while analysis of codebase size certainly had "entertainment value", it shouldn't be taken as an indicator of complexity. In contrast, the latter topic has been much discussed among kernel developers in recent times. In the course of reviewing some recent changes, Andrew Morton, for example, noted that kernel development had gone "beyond the point where any additional kernel complexity should be considered a regression". Avoiding regressions – ensuring that new kernels do not generate problems absent in previous versions – is considered to be an almost unbreakable commandment.
Torvalds himself criticised the growing complexity of the kernel in a recent interview with Zeit Online, noting that he awaits with concern the day when kernel developers are faced with a bug which no-one is able to get the measure of. According to a report on LWN.net, complexity was also a topic at this year's kernel developers' conference, where the memory management code came in for particular criticism. According to the report, a "[…] problem involving page migration took three core developers to solve. Nobody really knows how the whole thing is implemented." Another example can be found in an LWN.net article published last year which elucidates the background to a bug in the memory management code which arose during 2.6.34 development, the cause of which took Torvalds and a number of other developers several days to track down.
Analysis
At the time of writing this Kernel Log, a git checkout of the main Linux development tree contained 15,046,951 lines of code. It is possible, though unlikely, that the line count could drop back below the 15 million mark, but in the long term further growth appears inevitable since, with the exception of some outliers, recent versions have typically been around 100,000 to 300,000 lines larger than their direct predecessor.
A tool by the name of SLOCCount provides a more detailed analysis of the kernel codebase. In Linux 3.2 – the codebase for which, at 14,998,651 lines, fell just short of the 15 million mark – nearly 1.9 million lines are responsible for supporting different processor architectures. Similarly the filesystems directory contains just under 700,000 lines of code. The largest directory is the drivers directory, at 5.6 million lines. The total amount of code dedicated to drivers is in fact even greater than this, as some driver code is located in other directories. Drivers for audio hardware, for example, are located in sound/drivers/.
SLOC Directory SLOC-by-Language (Sorted)
5615064 drivers ansic=5610304,yacc=1688,asm=1475,perl=792,lex=779,sh=26
1876166 arch ansic=1632759,asm=241881,sh=692,awk=470,pascal=231, perl=58,python=45,sed=30
698974 fs ansic=698974
533134 sound ansic=532951,asm=183
493711 net ansic=493615,awk=96
301646 include ansic=299895,cpp=1709,asm=42
120454 kernel ansic=120149,perl=305
56177 tools ansic=51029,perl=3272,python=1399,sh=476,asm=1
54529 mm ansic=54529
44171 security ansic=44171
42627 crypto ansic=42627
37307 scripts ansic=22487,perl=8287,sh=2028,cpp=1820,yacc=1291,lex=947,python=447
28486 lib ansic=28473,awk=13
14382 block ansic=14382
11579 Documentation ansic=6896,perl=2369,sh=1018,python=949,lisp=218,awk=129
5705 ipc ansic=5705
4661 virt ansic=4661
2377 init ansic=2377
1876 firmware asm=1660,ansic=216
1232 samples ansic=1232
564 usr ansic=550,asm=14
0 top_dir (none)
SLOCCount also takes a close look at the files and generates an analysis of the programming languages used. It shows that Linux 3.2 contains just under 10 million lines of actual code, 97% of which is Ansi-C and 2.5% of which is assembler.
Totals grouped by language (dominant language first):
ansic: 9667982 (97.22%)
asm: 245256 (2.47%)
perl: 15083 (0.15%)
sh: 4240 (0.04%)
cpp: 3529 (0.04%)
yacc: 2979 (0.03%)
python: 2840 (0.03%)
lex: 1726 (0.02%)
awk: 708 (0.01%)
pascal: 231 (0.00%)
lisp: 218 (0.00%)
sed: 30 (0.00%)
Total Physical Source Lines of Code (SLOC) = 9,944,822
However, it has been some years since SLOCCount has been updated and, as the Pascal figure shows, it can get confused. Perl and Python code definitely are present in Linux however, with some kernel versions requiring the Perl interpreter to compile.
Next: Linux 3.0 promoted to long-term kernel, In brief