

### Low Power through Software Optimisation

SOCRATES'04 Joan Oliver **ETSE-UAB** 

# Power analysis of embedded software etse



#### □Until now

■ Power analysis techniques based on circuit-level or architectural-level

#### □Instruction-level power analysis model

- For off the shelf microprocessors/microcontrollers
- For embedded cores or IPs

#### □Problem:

■ Not available power consumption information

#### □So: development of a methodology and application to

- 486DX2-S with 4MB computer board for mobile applications Tiwari et alt.
- To the actual Motorola 68HC908GP32 microcontroller

☐Minimisation techniques in DSP processor

□Low power in Intel®855GM Chipset

SOCRATES'04 - Joan Oliver

### Experimental method (I)



- Microprocessor power consumption traditional method based on power consumption analysis of the unitary microprocessor modules → Difficult to establish general model because power consumption varies from program to program.
- Hypothesis: By measuring the current drawn by the processor as it repeatedly executes certain instructions or certain short instructions sequences, it is possible to obtain most of the information that is needed to evaluate the power cost of a program for that processor.
- Based on:
  - ☐ Complexity hidden behind a simple interface: the instruction set.
  - ☐ Individual instruction power analysis: specific circuit activity per instruction.
  - ☐ To take into account: inter-instruction effects.
  - ☐ Valid for standard microprocessor and embedded cores.

SOCRATES'04 - Joan Oliver

3

### Experimental method (II)



- Power versus energy
  - Average power consumed by a microprocessor:  $P = I \cdot V_{CC}$ .
  - Energy consumed by a program:  $E = P \cdot T$
  - **Execution time:**  $T = \tau \cdot N$ .
- Current measurement
  - ☐ Though results are specific for every processor and board, the methodology of the model is widely applicable.
    - The current is measured through a standard off the shelf, dual slope, integrating ammeter.
    - Execution time of the program is measured through a specific state detection by a logic analyser.
    - A program writen with several instances of the instruction sequence executing in a loop, has a periodic current waveform which yields a steady reading in the ammeter.
    - In the setup: Vcc=3.3V,  $\tau=25$ ns (finternal=40MHz)  $\rightarrow$ 
      - Energy cost of a program that takes N cycles with an average current of I amps is E = I-VCC·N·  $\tau$  = 8.25·10<sup>-8</sup>·I·N J

SOCRATES'04 - Joan Oliver

### Instruction level modeling (I)



#### ☐ Development of a instruction level model

- Each instruction has assigned a base energy cost.
- Base energy cost variation due to different operand and address values can be quantified.
- Energy cost of a program based on the sum of base energy costs.
- To be considered: circuit state effects and resource constraints (can lead to stalls and cache misses).

#### ☐ Base energy cost per instruction

- It is determined by constructing a loop with several instances of the same construction and measuring the average current being drawn.
- The total energy is is this average current multiplied by the number of cycles taken by each instance.
- Besides 486DX2 executes more than one instruction at a given time (including pipelining) the concept of base energy cost per instruction remains unchanged.
- When instructions take multiple cycles (in a given pipelining) also the base energy cost is just the average current measured multiplied by the number of cycles taken by the instruction in that stage.

SOCRATES'04 - Joan Oliver

5

### Instruction level modeling (II)



- CPU base costs for some 486DX2 instructions.
  - Overall base energy cost per instruction:  $E_B$ =Column3·Column4· $V_{CC}$ · $\tau$ .
  - Variations in repeated run experiments: ±1mA over average currents.

| :[ | Number | Instruction       | Base Cost | Cycles |  |
|----|--------|-------------------|-----------|--------|--|
| 1  |        |                   | (mA)      |        |  |
| 1  | 1      | NOP               | 275.7     | 1      |  |
| 1  | 2      | MOV DX,BX         | 302.4     | 1      |  |
| 1  | 3      | MOV DX,[BX]       | 428.3     | 1      |  |
| 1  | 4      | MOV DX,[BX][DI]   | 409.0     | 2      |  |
| 1  | 5      | MOV [BX],DX       | 521.7     | 1      |  |
| 1  | 6      | MOV [BX][DI],DX   | 451.7     | 2      |  |
|    | 7      | ADD DX,BX         | 313.6     | 1      |  |
| 1  | 8      | ADD DX,[BX]       | 400.1     | 2      |  |
| 1  | 9      | ADD [BX],DX       | 415.7     | 3      |  |
| 1  | 10     | SAL BX,1          | 300.8     | 3      |  |
| 1  | 11     | SAL BX,CL         | 306.5     | 3      |  |
| :[ | 12     | LEA DX, [BX]      | 364.4     | 1      |  |
| 1  | 13     | LEA DX, [BX] [DI] | 345.2     | 2      |  |
| 1  | 14     | JMP label         | 373.0     | 3      |  |
| 1  | 15     | JZ label          | 375.7     | 3      |  |
| I  | 16     | JZ label          | 355.9     | 1      |  |
| 1  | 17     | CMP BX,DX         | 298.2     | 1      |  |
| 1  | 18     | CMP [BX],DX       | 388.0     | 2      |  |

SOCRATES'04 - Joan Oliver





#### ■ Measurement conditions:

- The loop size should be large enough to in order to obtain a converged value. It minimises the impact of the branch conditions at the end of the loop.
- But it has not to be to much large in order to avoid caches misses.
- System effects like multiple time-sharing and interrupts are indesirable.

#### Variations in base costs:

- Table shows that instructions with differing functionalities and different addressing modes can have very different costs. It is expected since different functional blocks are being affected in different ways by these instructions.
- The same family of instructions shows different base costs depending on the value of the operands. For example, MOV instructions presents less base cost as number of 1's in data increases.

| data       | 0     | OF    | OFF   | OFFF  | OFFFF |
|------------|-------|-------|-------|-------|-------|
| No. of 1's |       |       |       |       |       |
| Base Cost  | 309.5 | 305.2 | 300.1 | 294.2 | 288.5 |

SOCRATES'04 - Joan Oliver

7

### Instruction level modeling (IV)



#### So:

- As seen by last table, variation in immediate operand values are significant.
- Use of different registers does not result in significant base cost differences.
- Range of variation shown by the ADD instruction is small: < 5%.
- In instructions involving memory operands, the base costs variations depends upon the address of the operand, and depends upon the number of 1's in the operand address.

#### Inter-instruction effects:

- ☐ When sequence of instructions are considered, comes into play certain inter-instruction effects.
  - Circuit state effect. When a pair of different instructions is considered, the constext is one of greater change. A circuit state overhead is obtained with a cost always greater than the base cost of the pair. As an exemple, the measured cost of the sequence of the next table is 332.8mA (avg. current over 10 cycles).

Using the base costs it should be shown a base cost of 326.8mA.

SOCRATES'04 - Joan Oliver







### Software power optimisation ■ Energy savings are possible through software optimisation. ■ Instruction reordering can imply power consumption reduction. ■ The instruction set chosen has influence on energy consumption. For exemple, instructions with memory operands have very high average current in front of instructions with register operands. Instructions using only register operands cost about 300mA Memory reads cost upwards 430mA. Memory writes cost upwards 530mA. Lesser (base clock) cycle instructions are energy saving instructions. For example, ADD DX, [BX] takes two cycles, while ADD DX, BX takes just one. ■ Potential pipeline stalls, misaligned accesses, and cache misses add to the running time. ■ Though reductions in number of memory operands can be achieved by adopting suitable code generation policies, the best way to save memory operands is through better use of registers. SOCRATES'04 - Joan Oliver 12



Results of energy optimisation of sort and circle algorithms

| Program                     | hlcc.asm | hht1.asm | hht2.asm | hht3.asm |
|-----------------------------|----------|----------|----------|----------|
| Avg. Current (mA)           | 525.7    | 534.2    | 507.6    | 486.6    |
| Execution Time (µsec)       | 11.02    | 9.37     | 8.73     | 7.07     |
| Energy (10 <sup>-6</sup> J) | 19.12    | 16.52    | 14.62    | 11.35    |
| Program                     | clcc.asm | cht1.asm | cht2.asm | cht3.asm |
| Avg. Current (mA)           | 530.2    | 527.9    | 516.3    | 514.8    |
| Execution Time (µsec)       | 7.18     | 5.88     | 5.08     | 4.93     |
| Energy (10 <sup>-6</sup> J) | 12.56    | 10.24    | 8.65     | 8.37     |

- ■hlcc.asm → assembly code generated by lcc, a general purpose C compiler that produces good code.
- hht1 → Hand tuning for shorter (a 15% reduction) running time.
  Only temporarily variables allocated in registers.
- htt2 → 3 local variables allocated in registers and the appropriate memory operands are replaced by register operands
- ■htt3 → 2 more local variables allocated in registers and all redundant instructions are removed.
- ☐ As a result, the sort algorithm has suffered a 40.6% reduction in power consumption, and the reduction in the circle algorithm is about 33%.

SOCRATES'04 - Joan Oliver

1.3

## Power optimisation at the MC68HC908GP32 Microcontroller



#### ■ The MC68HC908GP32 microcontroller

- It is an actual microcontroller based on an 8-bit technology with low-cost and high-performance attributes characteristic for the M68HC08 family. The internal bus frequency is 8MHz.
- Internal registers are: 8-bit accumulator, 16-bit index (acting also as a general purpose) register, the 16-bit program counter and an 8-bit condition code register. It includes also a stack pointer register of two bytes.
- The whole address space comprises 64KB divided in different functional regions.
- The stack addressing mode compensates the lack of internal registers.
- It has 5 8-bit port sets (four are bifunctional). There are 20 different modules, each of them dedicated to a given task, not to mention the CPU and the execution module (computer operating properly).
- The microcontroller has a low-voltage inhibit module which monitors the V<sub>DD</sub> value and forces a reset if it falls below a critical voltage.
- It can operate at 3V or 5V power supply. The maximum current in transitory state attains 100mA, although during the steady state the order of magnitude is at most of tens of milliamperes (at the port pins of maximal values of 10-15mA).
- It has special suitability for low-power applications. It possesses two idle operating modes named wait and stop. Both are characterized by a reduction of the power consumed. In the wait state only the CPU clock is disabled, whereas during the stop state, the control of the almost entire module spectrum is relinquished, thus lowering even more the power consumption.

SOCRATES'04 - Joan Oliver











## Instruction and sequence alterations (I) ☐ The series of improvements in terms of energy savings can be

- complemented with one-off steps, gathered in the present subsection:
  - 1. If it resulted better to address the (page zero) memory instead of accessing the stack (subsection *B*), the same can be affirmed, or seen as a mere consequence, in the case of the instructions TAX / TXA (transfers between accumulator and index register) with respect to the homologous ones implying the stack (TSX, TXS).
  - 2. Whenever possible, avoid using *division* (DIV) and *multiplication* (MUL), if, for example, the multiplicand or dividend is a power of 2 (care must be taken to the magnitude of the exponent when deciding which of the two alternatives gives better results). Or, if the same operation can be performed as a succession of shifts and sums, vid. subtractions.
  - 3. The interruption process consumes more time, hence energy, for it is necessary to push on the stack not only the program counter content as it happens when a routine is called, but also those of the index, condition code register, and accumulator and furthermore reset the interruption flag. That is why it is better to avoid SWI instructions in favour of JSR (routine calls). The same discussion applies to the corresponding return commands. The savings attain 4 and 3 cycles, respectively.
  - 4. A similar justification can be adduced for a simple jump (branch) compared to a jump to subroutine. The difference between the number of cycles that reflects the duration of an ordinary jump in relation to a routine call with the same addressing mode is of two units. One should add to it the accompanying return instruction, 4 cycles against an average of 3 for a back jump

SOCRATES'04 - Joan Oliver

### Instruction and sequence alterations (1)



- 5. A single instruction is sometimes more beneficial to our purpose than a couple which affords the same result (it shortens the program length, too). As an example, when moving data from one memory location to another, it is advisable to have it done by means of a unique move (MOV) instead of considering a sequence of loading and storing the data in and from the index register (LDX, STX)
- 6. If the semantic of program is known, as for example the data range with which it oftener runs, it is possible to change the branch conditions such that the taken jumps should result less frequent than
- 7. Any independent operation should be taken out of forks and cycles, since on the contrary they will count for both branches or every iteration, respectively.
- 8. When the number of iterations a cycle performs is known in advance and its value is sufficiently small, it is better to replace it with its body replicated the same number of times.
- 9. Other improvements imply disentangling nested calls and merging calls with the same iteration number

SOCRATES'04 - Joan Oliver



### Minimisation techniques in DSP processor ☐ Lee, Tiwari, ... analyse power model at instruction level in embedded DSP. Some significant points are: ■ Greater power consumption in DSP due to circuit state changes. That means that, with appropriate scheduling of instructions can lead to a reduction in the power-cost of the programs. ■ The study shows that faster programs consume less energy. Scheduling for power minimisation is explored. ■ Special architectures of the DSP processor provided to reduce the number of cycles for programs are also very effective for reducing the energy cost • Double data transfers from different memory banks to registers in one cycle. • Packing of two instructions into a single code-word. ■ Also, on-chip Booth multiplier is a major source of energy consumption for DSP programs. Proposal of an effective technique for local code modification by operand swapping to power consumption reduction. ■ The energy minimisation methodology applied to a given piece of code on a Fujitsu (3.3V, 0.5µm, 40MHz, CMOS) embedded DSP processor shows energy reductions ranging from 26% tp 73%. SOCRATES'04 - Joan Oliver 22













### Low power in Intel®855GM Chipset-II □ Power management architectures ■ Main Memory Power Management. Main memory power managed at normal and low-power advanced configuration. · Based on idle conditions in a given row of memory, that memory row may be If the pages of a row have all been closed at the time of power down, the device wii enter in a active power down state. Otherwise if pages remain open, the device will enter in a precharge power down state. ■ Graphics Memory Controller Hub Dynamic IO/DLL Power Management. Use of memory address tri-states when all memory powered down or self-refresh, memory clock tri-states for unpopulated DIMMs, disable control for control sense amps, data bus sense amps. Use of DLL (Delay Locked Loops) for adjusting input signals to data strobe signals. DLLs are designed with master and slave parts. The master calibrates the delay elements to tuen the entire delay line. The slave is the actual delay line uses to delay a functional signal. DLL's are disabled when possible ■ Global design criteria. Global design criteria has been followed to adjust the parameters: Validation methodology, from modeling hardware for logis simulation to BIOS validation in a pre-silicon environment. Modeling components for logic simulations Revision of circuit-level components Hardware emulation and BIOS validation. SOCRATES'04 - Joan Oliver



