Abstract Machine Model
The C abstract model is a conceptual model that describes how a C program is executed. It consists of a set of rules and assumptions that define the behavior of C programs without specifying how they should be implemented on a particular computer or operating system, or concerning itself with optimization or performance.
Understanding the abstract model is very important, because of the “as-if” rule–the compiler can optimize code as long as the end result of the program’s execution appears “as if” the program had executed according to the original source code. This means the compiler can rearrange, eliminate, and optimize a C program as long as the outcome of the program is consistent. Programmers without a firm understanding of the standards as applied through the as-if rule often rely on certain side effects, undefined behavior, and implementation details that lead to unexpected behavior when the same program is compiled or executed in another environment.
According to the abstract model, accessing a volatile object, modifying an object or file, or calling a function that does so, is called a side effect because it changes the state of the execution environment. A program is broken up into a series of sequence points, each of which marks a stable, settled, state in the program where all prior side effects have been resolved, and all following side effects have not yet occurred. Otherwise, the order of side effects that occur between sequence points is indeterminate.
From the external perspective, the main requirements of a program’s execution are that:
At each sequence point, volatile objects are stable and any accesses to such objects have completely occurred, or have not yet begun.
At program termination, all data written to files must be identical to what would be expected if the program executed exactly as described in the C standard. For example,
The dynamic input and output characteristics of the system are as described in the standard, such that file I/O and externally visible memory accesses occur in an order specified by the standards. For example, it must be possible to ensure that a prompt message is actually printed prior to querying input from a terminal, or to ensure that a read operation precedes a write operation on memory-mapped I/O.
Environment
The implementation is the (set of) software that facilitates translation of C source code into an executable format, and the subsequent execution of functions in the translated executable. The contexts in which these two tasks are accomplished are called the translation environment and the execution environment.
In many cases, a single system encompasses both of these environments, but in others they may be separated and further divided across two or more systems, such as when building software on a development machine (translation environment) that is meant to be deployed on a separate embedded system (execution environment) that lacks its own facilities to translate code into executables.
The distinction between the translation and execution environments is one of the reasons C is such a versatile language – many other high level languages like Python and Java require runtime support, such as an interpreter, which may be infeasible to deploy onto a resource constrained target system such as an embedded device.
Translation Environment
The translation environment is the portion of the implementation that supports conversion of C source code into an executable that can be run on the target system (the execution environment). Translation begins with a C source file and any other associated source files, libraries, and header files, which are collectively called a preprocessing translation unit. There are eight phases of translation, but they can be grouped into three main stages: preprocessing, compilation, and linking.
Preprocessing
The first six translation phases involve various text-processing steps that operate on the preprocessing translation unit in a purely syntactic manner, such as replacing comments with spaces, concatenating adjacent string literals, evaluating preprocessing directives, and performing pre-processor macro expansions. The output of this stage is called the translation unit, consisting of a sequence of syntactic preprocessor tokens, grouped into categories such as identifiers, numbers, strings, punctuations, and so on.
Preprocessing is a form of meta-programming, which has three main uses:
Injecting the contents of other source files directly into the current translation unit, using the
#include
directive.Injecting compile-time configuration parameters into source code, using preprocessor macros, such as to inject version information.
Conditionally compiling code in response to configuration parameters, such as to target a particular operating system.
Generating C source code at compile-time using macros.
Compilation
In the seventh phase translation phase, the syntactic tokens are further reinterpreted semantically as elements of the C programming language; compilation takes place, producing a compiled translation unit. The output of this stage is native machine bytecode, which is also known as object code.
Object code may contain references to identifiers that have been declared as external to the current translation unit, and are instead defined in another translation unit; these references remain unresolved following compilation, and prevent execution of functions which depend on them.
Linking
In the final phase, compiled translation units and library components needed to satisfy external references are collected to form a program image that contains all the information needed for execution in its execution environment. In other words, this produces an executable, or loadable, program from the pieces contained in object files that compilation produced.
Execution Environment
The execution environment is the system on which the compiled program is actually run. This more or less refers to the operating system (or lack thereof), and any additional support utilities (such as a loader/dynamic linker) or other running programs that affect the execution of a program. The most fundamental property of the execution environment is whether it is hosted or freestanding.
Hosted execution environment
A hosted environment provides an operating system along with a full implementation of the C standard library. A hosted environment manages program startup by calling into the standardized entry point of the program, a function called main()
, which is passed a list of strings, called its argument vector. The hosted environment also supports program termination and allows the program to return an exit status to the environment when it does so. This is generally facilitated by a small amount of system-specific code written in assembly, which handles the entry and exit of a C program, called the C-runtime (CRT).
Freestanding execution environment
On the other hand, a freestanding environment provides only a minimal subset of the standard library, and the entry point of a program is defined by the implementation. This allows for C programs to be written for a wide variety of systems, including embedded systems, where an operating system is not present. In fact, operating systems, themselves, are written for, and execute within, the context of a freestanding environment. This high degree of control and flexibility is what earned C the moniker “portable assembly”, and led to its widespread adoption as the de facto systems programming language.
Behavior
The external effects of a program are called its behavior. A program which strictly conforms to the C standards always exhibits well-defined, specific, behavior when executed on any conforming implementation. However, many useful programs conform to the standards while invoking additional non-standardized features such as system calls, compiler extensions, and internationalization; these programs may execute correctly while exhibiting behaviors that are unspecified, implementation-defined, and locale-specific, and may not be portable to other otherwise conforming implementations that lack support for those additional features.
Additionally, nonconforming programs exhibit undefined behavior, which is completely unpredictable, though often produces silent bugs that are difficult to diagnose.
In many situations, the standard provides for more than one possible behavior, but places no further restrictions on which behavior is chosen in any instance. This is called unspecified behavior. An example of this is that the individual arguments to a function call are each evaluated, but the order in which they are evaluated is unspecified. If one argument has some side effect when evaluated, it is indeterminate whether that side effect will have happened before, or after, another argument is evaluated.
Implementation-defined Behavior
Implementation-defined behavior is like unspecified behavior, except that the implementation is required to choose a consistent choice between several possible behaviors and document that choice. For example, the result of many operations involving signed integers, such as out-of-range results, are implementation-defined behavior.
Locale-specific behavior is that which depends on the specific localization of the execution environment, taking into account aspects of nationality, culture, and language. For example, many of the library functions used to process or categorize text exhibit locale-specific behaviors.
Undefined behavior (UB) is the boogeyman of C-programming, and is a fairly unique aspect of the language that often frustrates and amazes learners. UB is a catch-all category for everything that doesn’t fall into one of the above categories, and invoking undefined behavior can have any result at all, famously referred to as nasal demons.
Undefined behavior exists for two very important reasons. First, because there are few restrictions and safety mechanisms in place, it’s much easier to implement the C language than it is other safer languages. This is the primary reason why C became as ubiquitous and enduring as it is, because it was so simple to port a C compiler to different systems, at a time when every system ran proprietary hardware and operating systems that were all mutually incompatible.
The second reason UB is so powerful is because it enables incredible amounts of optimizations. The compiler is free to assume that UB does not exist–that the programmer never makes such a mistake, and thus can optimize in extremely clever ways. Without UB, compilers would have a much more difficult time ensuring programs run quickly while also remaining correct.
Constrained |
Consistent |
Documented |
Portable |
|
---|---|---|---|---|
Unspecified |
Yes |
No |
No |
No |
Implementation-defined |
Yes |
Yes |
Yes |
No |
Locale-Specific |
Yes |
Locale-dependent |
Yes |
Yes |
Undefined |
No |
No |
No |
No |