Loop Optimization

Code in loops may be executed repeatedly and therefore merits thorough optimization because speed gains are particularly noticeable. If a loop is iterated 1,000 times and the runtime of the loop body is shortened by one thousandth of a second as a result of optimization, total program runtime is one second shorter. One second may not seem to be much. However, the benefits are best assessed by considering very time-critical kernel actions such as task switching or long-running programs such as physical simulation. While execution time in the latter case may differ by hours or even days, savings of fractions of a second are a desirable goal in the former case — after all, task switches are performed at short intervals (which are beyond human perception) to give the illusion of parallel program execution.

This optimization feature is not difficult to understand and, in technical terms, appears to be relatively simple. This feature can be illustrated via the following short sample program:

int count;

for (count = 0; count < 3; count++) { printf("Pass: %d\n", count);

The loop is iterated three times in succession and outputs the number of the current pass (0, 1, or 2). What can be optimized here? The loop is implemented in assembler code by incrementing the count

6Note that compilation naturally takes longer when results are computed at compilation time. This is a common aspect of all optimization efforts — execution speed is boosted at the expense of compilation speed. However, this is acceptable because compilation is performed once only and execution takes place regularly.

status variable at the end of the loop body and then checking its value. If it is less than 3, the loop is restarted (a branch is made to the beginning of the loop body); otherwise, the code following the loop is executed. Without optimization, the generated assembler code looks like this:

.file "loop.c" .section .rodata

.type main,@function pushl %ebp movl %esp, %ebp subl $24, %esp andl $-16, %esp movl $0, %eax subl %eax, %esp movl $0, -4(%ebp)

jle .L5

jmp .L3

call printf leal -4(%ebp), %eax incl (%eax)

jmp .L2

leave ret

.size main,.Lfe1-main

If the number of loop passes is small, the code is often executed more quickly if the assembler code of the loop body is written to the output file several times in succession. There is then no need to compare the status variable with the end value, and the conditional branch can be dispensed with. If optimized loop unrolling is used, the GCC generates the following code:

.file "loop.c"

.section .rodata.str1.1,"aMS",@progbits,1

.p2align 4,,15 .globl main

.type main,@function main:

subl

$8, %esp

andl

$-16, %esp

movl

$0, 4(%esp)

movl

$.LC0, (%esp)

call

printf

movl

$1, 4(%esp)

movl

$.LC0, (%esp)

call

printf

movl

$2, 4(%esp)

movl

$.LC0, (%esp)

call

printf

movl

%ebp, %esp

popl

%ebp

ret

.size main,.Lfe1-main .ident "GCC: (GNU) 3.2.1

.size main,.Lfe1-main .ident "GCC: (GNU) 3.2.1

It should be noted that if this method is used, the size of the program code generated can increase drastically. The technical difficulty associated with this method is actually deciding whether to apply optimization or not. The optimal number of loop passes for which an improvement is achieved depends not only on the code in the loop body, but also on the processor type. The definition of corresponding heuristics in the compiler is therefore difficult, although the result they produce is easy to understand.

Continue reading here: Common Subexpression Elimination

Was this article helpful?

0 0