As a senior Linux/C++ programmer, it’s crucial to understand the journey of source code as it transforms into an executable file. This process involves several stages: preprocessing, compilation, assembly, and linking.
Source Code Files
Source code files are where it all begins. They contain human-readable instructions written in C or C++ languages. In C, these files typically have a .c
extension, while in C++, they usually have a .cpp
extension. For example, consider the following simple C++ source code file:
|
|
Preprocessing
The preprocessing stage is the first step in the compilation process. During this phase, the preprocessor handles directives in the source code that begin with #
, such as #include
, #define
, and #ifdef
. The result of preprocessing is an “expanded source code” file, typically with a .i
extension. This file includes all macro expansions and header file inclusions. For our example, preprocessing would replace PI
with its defined value and include the contents of iostream
:
|
|
Compilation
In the compilation stage, the compiler translates the preprocessed source code into assembly language specific to the target platform. This step involves syntax and semantic analysis, as well as optimization. The compiler outputs assembly code, usually with a .s
extension. Here is an example of assembly code for the x86 platform:
|
|
Assembly
The assembly stage converts assembly code into machine code, which consists of binary instructions that the CPU can execute directly. The assembler produces an object file with a .o
(on Unix/Linux) or .obj
(on Windows) extension. These object files contain the machine code but do not yet have finalized addresses. They are binary files and are generally not human-readable. Here’s why the assembler doesn’t complete address binding:
Why doesn’t the assembler complete address binding?
-
Multi-Module Programs: Large programs are often divided into multiple source files, each compiled into separate object files. During assembly, the assembler can only handle addresses for symbols within the current module and marks external symbols as unresolved.
-
Library Linking: Programs may use external libraries. These libraries are compiled separately, and their function addresses are not known during assembly. The assembler generates relocation records for these symbols, which are resolved in the linking stage.
Relocation
The object files include relocation records and symbol tables to handle unresolved symbols:
-
Relocation Records: These records indicate where addresses are yet to be resolved. For internal module symbols, addresses can be filled in directly; for external symbols, the assembler creates relocation entries.
-
Linking: The linker takes multiple object files and library files, combines them, and resolves all symbol references. It constructs a global symbol table and replaces unresolved addresses with actual addresses in the final executable. The linker also handles the final placement of code and data segments into the executable’s memory space.
Linking
The final stage, linking, involves the linker processing one or more object files and libraries to resolve external symbol references. It combines all object files and necessary libraries to produce the final executable file (with a .exe
extension on Windows or no extension on Unix/Linux). The linker’s output is a program file that can be executed directly on a computer.
Conclusion
Understanding the compilation process from source code to executable is fundamental for mastering C/C++ programming and debugging. Each step in this process—preprocessing, compilation, assembly, and linking—plays a crucial role in converting human-readable code into a format that the computer can execute.