C11 From Source Code to Machine Program

The work of compilers can be divided into several phases, as the following overview demonstrates:

□ Preprocessing — All pre-processor actions are performed in this phase. Depending on the compiler version, this phase is supported by an external utility (cpp) or by special library functions — both are initiated automatically by the compiler. On completion of preprocessing, there is only one (large) input file generated from the source file and all header files are included using the #include directive. The compiler itself is then no longer required to take account of the distributed structure of C programs over several files.

□ Scanning and Parsing — The syntax of a programming language can be described by means of grammatical rules that are similar to those of a natural language such as English but that must understandably be much more restrictive. (Although the existence of several alternatives to represent one and the same fact contribute greatly to the appeal and subtlety of a language, ambiguity must be avoided at all costs in programming languages.) This phase usually comprises two closely linked tasks. The scanner analyzes the source text character-by-character and looks for keywords of the programming language. The parser takes the input stream supplied by the scanner and already abstracted from source text representation and checks that the structures it detects are correct in terms of the grammar rules of the language. It also creates data structures in computer memory that are a further abstraction of the source code and are designed for processing by a computer (in contrast to the actual source code of the program that should be as easy as possible to read and manipulate by human beings).

□ Intermediate Code Generation — A further step along the path toward the final machine code converts the parse tree (i.e., the data structure created in memory) set up by the scanner and parser into another language known as the register transfer language (RTL). This is a kind of assembly language for a hypothetical machine. This language can be optimized — independently of the target processor for the most part. However, this does not mean that the RTL code generated in this phase of the compilation process is the same for all target processors. Depending on the architecture, a range of assembler statements are available — and this fact must be taken into account during RTL generation.

The individual statements of the RTL are already on a very low level and are a step away from the high-level C language on the path to the assembly language. Their main task is to manipulate register values to support execution of the compiled program. There are, of course, also conditional statements and other mechanisms to control program flow. However, this intermediate code still includes various elements and structures common to higher-level programming languages (these are not specific to a particular language such as C, Pascal, etc.) that do not appear in a pure assembly language.

□ Optimization — The most compute-intensive phase of program compilation is optimization of the intermediate code in the RTL language. The reasons why programs are optimized are clear. But how is this done by the compiler? Because the mechanisms used are generally not only complex but also sophisticated and even devious (subtle details must always be taken into account), it would not be difficult to write a long tome on optimization techniques alone, and a further one on their usage in the GCC. Nevertheless, this appendix illustrates at least some of the techniques employed. All optimization options are based on ideas that initially appear to be relatively simple. However, in practice (and in theory) they are difficult to implement. Such options include, above all, the simplification of arithmetic expressions (algebraic rewriting of terms into expressions that can be computed more efficiently and/or with a less-intensive use of memory), elimination of dead code (parts of code that cannot be reached by the program flow), merging of repeated expressions and items of code in a program, rewriting of program flow into a more efficient form, and so on — these are covered as individual topics in this appendix.

□ Code generation — The last phase is concerned exclusively with the generation of the actual assembler code for the target processor. However, this does not yet produce an executable binary file, but instead, it produces a text file with assembler instructions that is converted into binary machine code by further external programs (assemblers and possibly linkers). In principle, the assembler code has the same form as the code of the final program but can still be read by humans (not by machines) even if the power of the individual commands has reached machine level.

To provide you with a general overview of the various compiler steps involved, this appendix uses a classical example — the ''Hello, World'' program.

#include<stdio.h>

The program does nothing more than output the line Hello, World! and is typically the first program discussed in any C textbook. On IA-32 systems, the following assembler code is generated for further processing by the assembler and linker:

.file .section

.string .text .globl main

.type main:

pushl movl subl andl movl subl movl call movl leave ret

.size .ident hello.c"

.rodata

"Hello, World!\n"

main,@function %ebp

%esp, %ebp $8, %esp $-16, %esp $0, %eax %eax, %esp $.LC0, (%esp) printf $0, %eax main,.Lfe1-main "GCC: (GNU) 3.2.1"

If you are already familiar with assembler programming, you may be surprised by the somewhat strange form of the code. The GNU assembler employs the AT&T syntax instead of the more-widespread and therefore better-known Intel/Microsoft variant. Of course, both alternatives implement the same functionality but use different arrangements of source and destination registers and different forms of constant addressing. Section C.1.7 provides a brief description of these syntax elements.

The exact meaning of the individual assembler commands is of no concern here, because it is beyond the scope of this appendix to provide a full introduction to assembler programming. Indeed, a separate book would be needed for each architecture supported by the kernel. Of more importance is the structure of the code generated. Constant strings are held in a separate section from which they are loaded when they are passed to a function (printf in this case) or when they are generally needed. In assembler code, functions (only main is used here) retain the same name as in C code.

The same initial code generates totally different assembler code on IA-64 systems (because the architecture is completely different), but the ultimate effect is identical to that of the code generated on IA-32 systems.

.file "hello.c"

.pred.safe_across_calls p1-p5,p16-p63 .section .rodata

.align 8

stringz "Hello, World!\n"

.text

.align 16

.global main#

.proc main#

main:

.prologue 14, 33

.save ar.pfs, r34

.vframe r35

.body addl r14 = @ltoff(.LC0), gp ld8 r3 6 = [r14] mov r32 = r1

br.call.sptk.many b0 = printf#

mov r8 = r14 mov ar.pfs = r34 mov b0 = r33 .restore sp mov r12 = r3 5 br.ret.sptk.many b0

.endp main#

To give you an example of how this is handled by a non-Intel architecture, the following is the code generated by the ARM variant:

.file "hello.c" .section .rodata

.align 2

.text

.align 2

.global main

.type main,function

@ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, uses_anonymous_args = 0 mov ip, sp stmfd sp!, {fp, ip, lr, pc}

bl printf mov r3, #0

mov r0, r3

.align 2

.word .LC0

.size main,.Lfe1-main .ident "GCC: (GNU) 3.2.1"

How does the GCC obtain information on the capabilities and command options of the target processor? The answer is that a machine description is present for each target architecture supported. This consists of two parts and provides the desired information.

First, there is a file with instruction patterns whose structure is a mixture of LISP and RTL syntax.2 Some parts of this pattern can be supplied with values by the compiler when RTL code is generated. Restrictions can be placed on the possible values by defining various conditions or other prerequisites. Generation of the actual code is performed by the output patterns that represent the possible assembler instructions and are associated with the instruction patterns. The source files that hold the instruction patterns for the individual systems are a very important part of the compiler and are therefore correspondingly large. The statement list for IA-32 processors is about 14,000 lines long; the Alpha variant is 6,000 lines long; and approximately 10,000 lines are needed for the Sparc family.

The instruction patterns are supplemented by C header files and macro definitions in which processor-specific special situations that do not fit into the instruction patterns can be handled.3 It is necessary to use C code, even if the target instruction cannot be implemented with a fixed string or simple macro substitutions.

2LISP is a programming language whose origins lie in artificial intelligence. It is often used as a dynamic extension language for application programs. Large parts of emacs are programmed in a LISP dialect, and the GIMP image manipulation program uses the Scheme extension language (a simplified LISP variant). GUILE, a library developed by the GNU project, features simple options for providing programs with a Scheme interpreter as an extension language.

3 An explicit goal when developing the GCC was to rank performance higher than theoretic elegance. It would be possible to describe processors solely with the help of the instruction pattern files, but this would entail a loss of performance and flexibility. The additional macro definitions do not contribute to the elegance of the overall system, but are a useful enhancement to help accomodate the special features of the individual CPU types.

The size of the additional macro and C files is similar to that of the actual instruction patterns (IA-32: 12,000 lines; Alpha: 9,000 lines; and Sparc: 12,000 lines). They constitute an important part of the CPU definition and are essential for the generation of efficient code.

Continue reading here: C12 Assembly and Linking

Was this article helpful?

0 0