 |
 |
 |  |  |
 |
 | // LinuxTag 2006
Besuchen Sie uns auch nächstes Jahr wieder auf dem LinuxTag 2006
im Karlsruher Messe- und Kongresszentrum. Für nähere Details und den
genauen Termin besuchen Sie bitte die LinuxTag Homepage.
|  |
|  |
 |
 | |  |
EUROPAS GRÖSSTE GNU/LINUX MESSE UND KONFERENZ KONFERENZ-DVD 2005 |
 |
 |
|
 |
|
|
|
| Hauptseite // Vorträge // Current trends in Linux Kernel Power Management |
 |
 |
Current trends in Linux Kernel Power Management
Dominik Brodowski
University of Tübingen
Copyright © 2005 by the Author(s)
Abstract
The biggest advantage of any notebook -- it being able to run on
battery power -- is severely limited by the high default energy
consumption of modern hardware. Also, power consumption means heat generation. With this becoming one of the most striking problems to increasing CPU frequencies, features previously only known to notebooks have been introduced to server and workstation CPUs. Because of these aspects, it is necessary to make use of advanced power saving techniques. The Linux kernel lacked support
and awareness of this issue for a long time, but during the past two
years much effort has been spent on reducing energy usage. This talk
will discuss several projects and patches related to runtime power management,
show their effect on power consumption and report on their current
status and whether an inclusion into the Linux kernel sources
maintained by Linus Torvalds seems likely.
Reducing power usage with the system running as normal as possible is one emerging area of power management in the Linux kernel. Several
different projects aim at the reduction of CPU power consumption. The
frequency the CPU executes code can be varied, even dynamically
(cpufreq); the CPU can be put into a low power mode if there is
nothing to do (ACPI C-States), and the CPU doesn't need to be woken up
each millisecond from this low power mode if there is no work to do
(tickless system). While cpufreq and ACPI C-States are ready for
prime-time usage, they continue to be an area of development. In contrast,
tickless systems are mostly a theory of concept so far, but improved patches seem likely for the near future.
In addition, other parts of the system also can be powered down if
there is nothing to do for them. For example, the so-called "laptop
mode" allows for harddisks to spin down for longer periods of
time. USB, PCI and PCMCIA devices and LCD backlights can be put into a low power
state manually if they are not needed.
If the user is willing to spend some time to properly set up the system, it is
already possible to use notebooks for a long time on battery power or to reduce the energy consumption of servers and workstations.
This talk will give some hints on this setup process and show
trends which will likely lead to even better power consumption rates in
future.
Demand for Runtime Power Management
With the number of notebooks being sold increasing super proportionally[1], with hand-held devices and wearable computing[2] allowing “ubiquitous computing”[3], the Linux Operating System needs to become ready for what Nils Faerber called its “third age”[4] on last year's LinuxTag: excellent support for stand-alone, “wireless” computing devices. Besides the continuing troubles for proper device driver support – meaning that they work at all – this calls for managing their power consumption so that they run for long amounts of time with the existing, limited energy sources.
One different aspect of power management is that it reduces the amount of heat being generated, prolonging the life cycle of devices, allowing for passive (fan-less) cooling and thus causing less noise. Therefore, power management largely improves the technical usability of devices.
Runtime Power Management in contrast to Suspending
Two different concepts are distinguished in power management: suspending and runtime power management. The former means techniques like standby (ACPI[5] S1), suspend to memory (ACPI S3) or disk (ACPI S4) where the computer is put into a non-working “sleeping” state. In contrast, runtime power management attempts to provide uninterrupted usability of the system, and so tries to stay hidden from the user point of view.
As this paper focuses on runtime power management, here follows only a short description on these techniques: The standby mode stops the CPU, but with all relevant data continuing to be stored in memory, normal operation can be resumed within short time. When suspending to memory, the CPU and several other devices are powered down and need to be re-initialized with information stored in system memory. This means the waking process takes a bit longer. Suspend to disk saves the userspace state into an image on a hard disk and then powers off the computer completely.[6] When the computer is powered on the next time, the image is restored and terminals, editors and Internet browsers will show the same content and will behave exactly the same way as before the suspending command having been issued.
Even though the ACPI standard[7] attempts to provide an operating-system independent way to put a computer into such a sleeping state and to awake it again, the actual implementations need to be aware of many hardware-specific issues. With several manufacturers withholding the information necessary to write appropriate drivers, support for suspending devices is still highly experimental in Linux.[8]
Linux and Runtime Power Management
Linux traditional had a “weak spot” with regard to runtime power management. Partly no appropriate documentation for the highly integrated, specific hardware of notebooks has been made available to developers,[9] partly there exist constraints in the operating system,[10] and partly developers lacked interest – and time – to write appropriate code. Lately however, much work has gone into reducing the power consumption of mobile devices running Linux. Therefore, this paper will explain some of the features available in current distributions and kernels, and some which we'll likely see being merged in the near future.
As they promise the largest reduction of energy consumption, it is important to determine those devices which typically consume most power. With processors specifically designed for mobile devices being available, the picture isn't as clear as it was before, though: while on desktops and servers CPUs require most power (up to 60%),[11] the display back light has become the leading consumer on modern notebooks.[12] Other major consumers are hard disks, system memory and wireless LAN devices.
As CPUs offer quite many different and effective methods to reduce their energy consumption, this paper will cover them first.
As the Central Processing Unit (CPU) of computers consume large amounts of energy, several different techniques attempt to reduce this. If there is no work to do, the processor is put into an low-power idle state (1), the frequency the CPU operates at can be modulated (2), and the CPU can be forced to a non-working state for short periods of time (throttling, 3).
Possibly the most important runtime power management technique is the „idling“ of CPUs. Under normal operation the CPU only needs to execute code once in a while; most of the time it either waits for data to arrive from hard disks or the user, or there simply isn't anything it could do. For example, on the author's system the CPU is only needed approx. 4% of the time while writing this text.
When there is no work to do, certain parts of the CPU – the actual code execution units, for example – can be shut down and re-activated once they are required for operation again.
On modern processors there exist multiple different such “idle states”[13], which allows to have states available where the CPU can be reactivated most quickly, but the power savings are limited (on the x86 architecture, this is the “hlt” command), and other states where on the one hand more power is saved, but on the other hand the wakeup latency – the time from an interrupt event to the CPU being ready to execute code again – increases.
Table 1 shows the energy consumption of some common CPUs of the x86 architecture depending on the idle state. With reductions of up to 60 to 98% compared to “continuous operation” and common CPU usage rates of 5 to 50%, this means a reduction in CPU energy consumption somewhere between 30% and 90%.
Improvement #1: ACPI _CST
With more different idle states being implemented in hardware which allow for a more fine-grained compromise between energy consumption rates and responsiveness, the operating system needs to be “smart” enough to activate them. On the x86 hardware, only the ACPI subsystem can utilize multiple idle states, and even it was designed for only three different idle states at first. The ACPI specification revision 2.0 added a new method called “_CST” for communication between the BIOS and the operating system to allow for even more idle states. Starting with kernel 2.6.11; support for this method was added to the Linux kernel. You can check which ACPI-based idle states are available on your system by cat'ting /proc/acpi/processor/*/power.
Improvement #2: Multiprocessor Support
As one physical CPU can only be put into an idle state as a whole, processors implementing synchronous multi-threading (SMT, for example HyperThreading) or multiple sub-CPUs on one CPU die (multi-core CPUs) need special care when they are to be put into an idle state. Also, on platforms using multiple physical CPU packages, special care needs to be taken that the CPU caches maintain valid content.[14] Due to these obstacles, idle states higher than C1 are not present in many BIOSes and platforms, and the Linux kernel even lacked support for them as of release 2.6.12. A small and simple patch already included into the ACPI development tree adds support for real, effective idle states on multiprocessor systems. As it is a hardly invasive patch, it is very likely to be merged into kernel 2.6.13 and might already be available in kernels used at the time of LinuxTag 2005.
Improvement #3: Tickless Systems
While the much discussed change from 100 HZ to 1000 HZ in kernel 2.6. – meaning the kernel is activated 1000 times a second (“ticks”) to determine which process is to run next and to fulfil some “housekeeping” tasks – provides better interactivity, it also means the time between entering and leaving idle states decreased dramatically. Taking into consideration that the putting into sleep and waking from sleep also consumes energy[15] it becomes clear that the power usage increased because of this change. Therefore, certain distribution kernels are specially modified for notebooks and continue to run with 100 HZ on mobile devices.
The technically superior approach is to not wake up every millisecond, but only when there actually is a new task to run (“dynamic ticks” or “tickless systems”) or interrupt activity (e.g. the user hitting a key on the keyboard) demands reaction by the kernel and userspace programs. With the Linux kernel relying heavily on the concept of “jiffies” (which is a counter incremented each “tick”) for timers, fair scheduling, and the interrupt generating hardware needing special care to be deactivated or to run at modifying rates, making the kernel “tickless” is a demanding and tough task. With the timing core of the kernel likely to undergo a major overhaul[16] making it aware of tickless systems, one obstacle seemed to disappear soon. Surprisingly, a group of developers did not even wait for this feature and proved with a medium-sized patch[17] that tickless systems are not as far away as feared. Nonetheless, this approach still needs to be considered highly experimental and the modified code should not be used on so-called production systems yet.
Perhaps the best-known runtime power management technique is CPU frequency and voltage scaling – not from its technical name, but from marketing names like Intel(R) SpeedStep Technology, AMD PowerNow! and Cool&Quiet! or Transmeta Longrun, to name a few.
What makes this such an interesting feature? If the CPU clock frequency is lowered, the energy consumption is reduced linearly. However, the voltage driving the CPU can be lowered as well, resulting in a highly increased “instruction per energy consumption” ratio[18]. In plain speak: if you accept to wait longer for the result, you can compute much more data with the same amount of energy. Taking into consideration that usually there is not much work to do for the CPU – even watching a DVD is not very CPU intensive, and most often there are other “bottlenecks” like internet connections – and frequency and voltage scaling also has positive side-effects on CPU Power States[19], this is most definitely something a user should take use of when running a notebook on battery power.
Support for CPU frequency (and voltage) scaling is highly hardware-dependant. Therefore, only a minority of all platforms have CPU frequency scaling implemented; however with the emergence of “AMD Cool&Quiet” and “Intel Enhanced SpeedStep” this technology has propagated to the server market.
In the Linux kernel, support for CPU frequency and voltage scaling is provided by the cpufreq subsystem in all 2.6. kernels[20], and more and more hardware drivers are added. However, especially the userspace side of cpufreq is still in heavy development, with different daemons and tools springing out of the ground at a high rate. A common midlayer, cpufrequtils, is emerging, and using its cpufreq-info tool you can determine whether cpufreq is activated on a system.
Improvement #1: Dynamic Frequency Scaling
While frequency switching only occurring on changes of the power source – an AC adapter being attached or removed – already provides important reductions in energy consumption and thus was used in first-generation CPU frequency scaling implementations, on-demand switching between multiple frequency levels leads to continuous heat reduction and allows to utilize the CPU fully even on battery power, with energy consumption suffering only a minor increase.
In-kernel or userspace scheduling?
While the academic community, which lead the development of Linux cpufreq at first, suggested to let userspace tools determine the appropriate CPU frequency,[21] Linus Torvalds objected to this approach and explained that only the kernel has sufficient knowledge to select the best appropriate CPU frequency.[22]
Therefore, the current Linux cpufreq infrastructure is built around the concept of in-kernel governors. A governor is an algorithm which calculates the CPU frequency it considers appropriate for any given moment. One such governor – cynics may call it a pseudo-governor – allows for userspace control over the CPU frequency. Therefore, both approaches – userspace and kernelspace deciding over the processor speed – are possible and widely used.
So, which method should an user select? Userspace governors provide for more fine-grained tuning, many more and different algorithms are available in a multitude of userspace cpufreq daemons,[23] and academic studies have proved the technical superiority of this approach.
However, we do live in an imperfect world: these academic studies depended on highly specialised software which added callbacks to the daemon governing the processor speed; they rely on trusting the values given by these callbacks, and these devices had a highly predictable CPU load percentage. While on specialised embedded systems this userspace approach indeed seems to be best, on general-purpose computers, where many different and non-specialised and sometimes old software programs need to run which cannot easily have such callbacks added, only the kernel “knows” how much processing time it wants programs to give, and programs needed to run (process scheduler)[24], how many IO requests are pending (io and net schedulers) and whether the system temperature is getting too hot (ACPI, lm_sensors).[25] Combining all this data to determine the most appropriate CPU frequency is the difficult task in-kernel dynamic cpufreq governors need to fulfil.
Different in-kernel governors
As of 2.6.12, there are four cpufreq governors present in the Linux kernel. You can select which one to use using the command “cpufreq-set -g GOVERNOR” provided by the cpufrequtils package. The userspace governor was already described above; it allows for userspace control over the CPU frequency: “cpufreq-set -f FREQUENCY”. The performance and powersave governors statically select the highest and lowest CPU frequency currently available. Only the ondemand governor available since 2.6.8 provides for in-kernel dynamic frequency scaling.[26] The ondemand governor decreases the frequency step by step if the processor is idle, but increases it to full processing power if there seems to be demand for it. It needs to be noted that for example AMD processors do only allow for step by step switching between frequencies; using this governor may thus cause unexpected latencies.
Two additional in-kernel cpufreq governors named conservative and past are discussed at the moment.[27] The conservative governor[28] tries to overcome the limitation of the ondemand governor on AMD CPUs noted above and only increases the CPU frequency step by step – and only if the demand for processing power is present for a longer period of time. In doing so, it tries to reach more time at lower frequency states and provide for more energy savings at the cost of slightly reduced performance and responsiveness.
The past governor[29] implements a completely different calculation algorithm for the targeted CPU frequency.[30] It always targets a CPU load of 70%. If the CPU load was lower in the “window” looked at in the past, the CPU speed is decreased; if it was higher, it is increased – always aiming at 70% CPU load under the assumption of constant CPU load. As it tries to modify the CPU frequency only slightly – not the “all or nothing” decision the ondemand governor does when the CPU load is too high – it attempts to stay loner at lower frequency levels and provide increased energy savings. However, the large “idleness window” of 30% may prove to be counter-productive.
Improvement #2: Multiprocessor Support
One physical CPU package must run at one frequency, meaning there needs to be some sort of coordination between multiple logical or physical CPUs. This was achieved by changing cpufreq to be “processor package”-centric instead of “logical CPU”-centric. Another problematic issue is that the primarily used timing source on the x86 architecture, the Time Stamp Counter (TSC), is located on the CPU die and therefore is affected by a changing CPU frequency. Especially on SMP systems you should use different timing sources (ACPI Power management Timer, HPET) to achieve sufficient precise results, even if they are slower to access.
The last and least CPU-related power management technique to be explained here is throttling. It stops the execution of commands in the CPU for certain short periods of time. As such, it means the CPU frequency is lowered; however it is not done in an homogeneous manner. Therefore the CPU voltage cannot be lowered in the meantime.
In contrast to CPU frequency scaling, where the operating frequency is constantly modulated, throttling means the CPU is forced to a halt for short periods of time. If throttled, the CPU into a physical and electrical state comparable to the idling states mentioned above, so it can be described as an “enforced idling” of the CPU.
As certain CPU power state typically utilize similar hardware implementations,[31]and as throttling does not have a positive effect on the energy consumption during CPU power states, throttling the CPU by a given rate is only useful if the CPU is less idle than the throttling rate.
Therefore, throttling makes only sense if the CPU temperature has become too hot because the CPU was active excessively. As throttling “forces” some “idling”, which then lowers the energy consumption and heat generation, it is a good tool for “passive cooling”. Passive cooling – in contrast to active cooling which utilizes fans – is done by lowering the CPU load and works best if using CPU frequency scaling for it, but is also possible using throttling.
Throttling is implemented in a few cpufreq processor drivers, most notably p4-clockmod,[32] and on ACPI-based platforms it is user-controllable using the file „/proc/acpi/processor/*/throttling“.
While the CPU is in the throttling state, on typical Intel or AMD x86 and x86_64 CPUs it consumes as much power as in the Stop Grant (Cache Snoopable) state. Depending on how high the throttling rate is, and how idle the system is, the CPU energy consumption will vary.
Assuming an idleness rate of zero, the power usage rates of some CPUs of the x86 architecture can be seen in table 2. Assuming one specific computing task needs 1s to finish at 100% CPU power available, the CPU energy usage for this task actually increases if throttling is used (table 3).
This shows that throttling itself does not save battery power at all – it even increases battery usage of any specific computing task. However, if a fan doesn't need to be started because of passive cooling, or if certain tasks do not need to be run (e.g. the screen is normally refreshed each 10ms, but the CPU is throttled so much it misses the deadline every second occasion, so the screen is only refreshed each 20ms), energy consumption may be reduced slightly overall.
The large amounts of energy consumed by CPUs can be reduced largely by using idling and frequency scaling techniques. As a last resort, also throttling can be used for thermal management. While there is still need for improvement, especially in the support for multi-processor systems, the existing infrastructure already provides first-class power management results.
Even though the CPU may well be the largest single consumer of energy, most power is still consumed by other devices. With CPU power management reaching a point of saturation, the savings potential of these devices gains more and more focus both by hardware manufacturers and kernel developers.
It is important not to shut down a device, or to put it to sleep, if it is needed by other devices still in use. The integrated device and driver model developed for kernel 2.6. attempts to show these dependencies.[33] However, appropriately describing all logical, technical and electrical dependencies continues to be a major hurdle.
With LCD backlights being one of the major power consumer, four different methods exist currently to reduce their power consumption. The first is implemented independently of the operating system by the firmware, which is called by special keys or key combinations, or the lid being closed.
On some platforms, the ACPI video module allows to modify the brightness using “/proc/acpi/video/*/brightness”. However, this proved to be non-functional on the author's notebook; additionally it is only available on some newer notebooks conforming to the ACPI 2.0 or 3.0 standard[34].
Additionally, the display power management configuration in the X window manager may influence the backlight: even though these settings[35] technically are meant to affect only the display and not the backlight, on some platforms they do affect both.
Finally, a new backlight infrastructure was merged into kernel 2.6.11. It attempts to provide an unified interface in “/sys/class/backlight/” to handle backlights on all architectures and platforms. However, only one device driver for Sharp Corgi PDAs is present in the kernel as of 2.6.12-rc3.
Hard disks can also be put into a power-saving mode, where, for example, the spindle motor is turned off. The delay of how long the drive waits between the last read or write command until it enters this mode can be modified using the “hdparm -S” command.[36] However, the repeated starting and stopping of hard disks might wear it down at an advanced rate,[37] so do not set the delay too short.
However, during normal operation the hard disk is accessed quite often: whenever a file[38] or information related to a file[39] is changed, the kernel waits at most five seconds until the information is written out to disk. Using the “laptop mode” tuning available in 2.6. kernels, this delay is increased. Also, the kernel tries to batch disk access, so that whenever the hard disk is woken up, all possibly pending requests are handled so that the hard disk can sleep for a longer time again, hopefully.
The scripts to run “laptop mode” are both included in the kernel sources[40] and in distribution packages. They are highly configurable, and the possible settings are described in an excellent manner both in the kernel[41] and in the configuration file itself[42].
Also, manipulating userspace may reduce the disk accesses – for example, the CUPS daemon and any logging utilities commonly write data out at a continuous rate. Therefore, you can consider whether the loss of this functionality – or, at least, its reduction by storing the log file on a tmpfs partition – outweigh the increase in battery usability, and – as an added bonus – noise reduction. The CPU usage of these tools can usually be ignored, though.
Bus Devices (PCI, USB, PCMCIA et al.)
The Linux device model[43] offers a unified interface to put devices into a freeze or other runtime[44] power management states, as this feature is needed for proper suspend to disk support. While an interface to userspace exists for each device in the file power/state inside the sysfs representation of a physical device, this interface is almost completely unusable at the moment.[45] As it is intended to be fixed soon, here follows only a short description as of kernel 2.6.12-rc4: echoing “3” into power/state asks to put the device into a freeze[46], writing “0” to this file puts the device back into full power.
However, PCI devices which do not have a driver attached are not put into a low power or even off state at the moment. Also, devices are not disabled or put into an off state which might offer even greater savings.
Therefore, it might help to take a closer look at the drivers governing devices. For example, loading and rmmod'ing the ipw2100 driver for the Intel ® PRO/Wireless 2100 Driver for Linux puts the device into an off state. Using the vesafb driver for X instead of ati leads to a much higher power consumption rate.[47] Also, WLAN antenna power consumption might be tunable using “iwconfig device power”.[48]
As can be seen above, achieving a high reduction of power consumption requires the combination of several, differing techniques affecting different parts and pieces of hard- and software. However, merely adding these techniques, meaning using them side by side, is not the way to go – there needs to be coordination between the differing techniques to achieve an optimum between the conflicting goals of usability and energy consumption.
Currently, this coordination is required to be done, at least partly, by the user or system administrator. Several power management tools only handle one or few techniques, and, for example, none allows for putting specific devices into low power states. In addition, there is no integration with the actual applications which sometimes know quite well how much processing power in the CPU or in the graphics adapter they need. And when they next need data from the hard disk.
While a userspace-based library governing all these aspects already exists for specialised embedded systems, the amount of work necessary to make existing applications aware of such power management callbacks has hindered the development of such a daemon on multi-purpose computers.
In addition, such an unified userspace tool would easen the path to make the advanced power management techniques described in this paper not only to those willing to fine-tune their system, re-compile kernels and even risk some loss of data, but also to the ever-groing public using Linux not wanting to dig into kernel internals or externals.
While runtime power management was not adequately taken care of in the Linux kernel for a long time, several exiting features have already been included in kernels of the 2.6. series. Several emerging projects still promise even increased savings in energy consumption rates, most notably tickless systems and bus device power management. Therefore, staying at the bleeding edge of the stable kernel series is likely to continue letting the user experience continuing improvements in runtime power management.
Table 1: Idle States and Energy Consumption
|
Processor[a] [b]
|
C0 (Normal operation)
|
C1[c] [d]
|
C2
|
C3
|
C4
|
|
AMD Geode NX 1750
|
14 – 25 W
|
unk.
|
unk.
|
3.0 W
|
n/a
|
|
Mobile AMD Athlon 64 2800+
|
35 W
|
2.2 W
|
2.2 W
|
unk.
|
n/a
|
|
Mobile AMD Athlon 64 2800+, frequency scaled to 800 MHz
|
12 W
|
2.2 W
|
2.2 W
|
1.2 W
|
n/a
|
|
Intel Pentium M 1400 MHz
|
22 W
|
7.3 W
|
7.3 W
|
5.1 W
|
0.55 W
|
|
Intel Pentium M 1400 MHz, scaled to 600 MHz
|
6 W
|
1.8 W
|
1.8 W
|
1.1 W
|
0.55 W
|
|
Intel Mobile Pentium III 600 MHz
|
8.7–14.4 W
|
1.1 W
|
1.1 W
|
0.3 W
|
n/a
|
Table 2: Throttling Rates and Power Consumption
|
Processor[a] [b]
|
0 % throttling
|
25 % throttling
|
50 % throttling
|
75 % throttling
|
|
Mobile AMD Athlon 64 2800+
|
35 W
|
26.8 W
|
18.6 W
|
10.4 W
|
|
Intel Pentium M 1400 MHz
|
22 W
|
18.3 W
|
14.7 W
|
11.0 W
|
Values are calculated by P = Px (1 – r) + Ps . r
where Px is the power consumption at normal operation, Ps is the power consumption in Stop Grant state, and r is the throttling rate.
Table 3: Power Consumption for a specific Computing Task related to Throttling Rates
|
Processor[a]
|
0 % throttling
|
25 % throttling
|
50 % throttling
|
75 % throttling
|
|
Mobile AMD Athlon 64 2800+
|
35 Ws
|
35.7 Ws
|
37.2 Ws
|
41.6 Ws
|
|
Intel Pentium M 1400 MHz
|
22 Ws
|
24.4 Ws
|
29.4 Ws
|
44.0 Ws
|
Values are calculated by W = P(r) / (1 – r)
where P(r) is the power consumption at the specified throttling rate, as were determined in table 2.
|
 |
|