As a senior Linux/C++ programmer, it’s crucial to understand the journey of source code as it transforms into an executable file. This process involves several stages: preprocessing, compilation, assembly, and linking.


Source Code Files

Source code files are where it all begins. They contain human-readable instructions written in C or C++ languages. In C, these files typically have a .c extension, while in C++, they usually have a .cpp extension. For example, consider the following simple C++ source code file:

1
2
3
4
5
6
7
8
// main.cpp
#include <iostream>
#define PI 3.1415926

int main() {
    std::cout << "PI value is " << PI << std::endl;
    return 0;
}

Preprocessing

The preprocessing stage is the first step in the compilation process. During this phase, the preprocessor handles directives in the source code that begin with #, such as #include, #define, and #ifdef. The result of preprocessing is an “expanded source code” file, typically with a .i extension. This file includes all macro expansions and header file inclusions. For our example, preprocessing would replace PI with its defined value and include the contents of iostream:

1
2
3
4
5
6
// main.i
// Expanded content of iostream
int main() {
    std::cout << "PI value is " << 3.1415926 << std::endl;
    return 0;
}

Compilation

In the compilation stage, the compiler translates the preprocessed source code into assembly language specific to the target platform. This step involves syntax and semantic analysis, as well as optimization. The compiler outputs assembly code, usually with a .s extension. Here is an example of assembly code for the x86 platform:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
.section .rodata 
.LC0: 
    .string "PI value is %f\n" 
.text 
.globl main 
main: 
    pushq %rbp 
    movq %rsp, %rbp 
    subq $16, %rsp 
    movss $0x40490fdb, -4(%rbp) // PI value as float 
    movq $LC0, %rdi 
    leaq -4(%rbp), %rsi 
    call printf 
    movl $0, %eax 
    leave 
    ret

Assembly

The assembly stage converts assembly code into machine code, which consists of binary instructions that the CPU can execute directly. The assembler produces an object file with a .o (on Unix/Linux) or .obj (on Windows) extension. These object files contain the machine code but do not yet have finalized addresses. They are binary files and are generally not human-readable. Here’s why the assembler doesn’t complete address binding:

Why doesn’t the assembler complete address binding?

  1. Multi-Module Programs: Large programs are often divided into multiple source files, each compiled into separate object files. During assembly, the assembler can only handle addresses for symbols within the current module and marks external symbols as unresolved.

  2. Library Linking: Programs may use external libraries. These libraries are compiled separately, and their function addresses are not known during assembly. The assembler generates relocation records for these symbols, which are resolved in the linking stage.

Relocation

The object files include relocation records and symbol tables to handle unresolved symbols:

  • Relocation Records: These records indicate where addresses are yet to be resolved. For internal module symbols, addresses can be filled in directly; for external symbols, the assembler creates relocation entries.

  • Linking: The linker takes multiple object files and library files, combines them, and resolves all symbol references. It constructs a global symbol table and replaces unresolved addresses with actual addresses in the final executable. The linker also handles the final placement of code and data segments into the executable’s memory space.


Linking

The final stage, linking, involves the linker processing one or more object files and libraries to resolve external symbol references. It combines all object files and necessary libraries to produce the final executable file (with a .exe extension on Windows or no extension on Unix/Linux). The linker’s output is a program file that can be executed directly on a computer.


Conclusion

Understanding the compilation process from source code to executable is fundamental for mastering C/C++ programming and debugging. Each step in this process—preprocessing, compilation, assembly, and linking—plays a crucial role in converting human-readable code into a format that the computer can execute.