Preprocessor Language

The C preprocessor is a wholly separate component of the C programming language which carries out the first six stages of translation. The preprocessor textually operates on a translation unit and performs simple syntactic parsing of a translation unit into tokens such as identifiers, punctuation, numbers, character constants, and string literals. These tokens are processed and the resulting token stream acts as the input of the compiler, itself.

Line Continuation

During translation, the preprocessor first removes any backslash-escaped line endings from the translation unit–any line that ends with a backslash (\) is concatenated with the line immediately following it. This functionality exists for the purpose of breaking long C++-style comments and preprocessor directives, both of which otherwise terminate at the end of a line.

For example, the X-macro pattern involves listing a large number of items, which is convenient to break over multiple lines as shown below,

#define BIOME_LIST \
   X(Desert) \
   X(Tundra) \
   X(Rainforest) \
   X(Grassland) \
   X(Savanna) \
   X(Taiga) \
   X(CoralReef) \
   X(Mountain) \
   X(Forest) \
   X(Swamp)

When the preprocessor removes backslash-escaped line endings, the above snippet is equivalent to,

#define BIOME_LIST X(Desert) X(Tundra) X(Rainforest) X(Grassland) X(Savanna) X(Taiga) X(CoralReef) X(Mountain) X(Forest) X(Swamp)

Comment Elision

Next, the preprocessor removes any comments and replaces each with a single space. There are two forms of comments, called C-style (/* ... */), and C++-style (//). C-style comments begin with /* and end with the first matching */; these comments may span multiple lines, and do not nest. C++-style comments begin with // and extend to the end of the current line; note that comments are parsed after backslash-escaped line endings are elided.

Finally, the entire translation unit is lexically parsed into tokens–header names, identifiers, preprocessing numbers, character constants, string literals, punctuators, and whitespace.

Preprocessor Directive Execution and Macro Replacement

Following these initial preprocessing steps, the preprocessor executes all preprocessing directives and expands any recognized macros. Preprocessing directives are used to conditionally process and skip sections of source files, include other source files, and manage macro definitions. A directive is a line beginning with a # token, followed by any additional tokens on that line, and there are fifteen preprocessor directives, described below,

directive:

#

if

constant-expression

#

ifdef

identifier

#

ifndef

identifier

#

elif

constant-expression

#

else

#

endif

#

include

<

header-name

>

#

include

header-name

#

define

identifier

replacement-list

#

define

identifier (

[ param-list ]

)

replacement-list

#

undef

identifier

#

line

line-number

[

file-name

]

#

error

[ error-message ]

#

pragma

[ token… ]

#

Conventionally, since preprocessor directives are not part of C source code, the # of a directive is always placed in column 0 of the line it appears on, rather than being indented to match surrounding code.

Conditional Inclusion

The preprocessor supports several directives to conditionally compile portions of the translation unit. These are the #if, #ifdef, #ifndef, #elif, #else, and #endif directives.

The #if and #elif directives check whether their controlling constant expressions evaluate to nonzero. These directive support one special unary operator, defined, which may be used in two forms,

defined

identifier

defined

(

identifier

)

which evaluates to 1 if identifier is currently defined as a macro name, and 0 otherwise. The directives #ifdef and #ifndef are simply shorthand for #if defined and #if !defined.

Aside from operands to the defined operator, any macros are replaced before evaluation of the controlling expressions. If an identifier remains after macro replacement, it is replaced with 0.

A set of related conditional directives are evaluated in sequence. If a condition evaluates to false, then the group it controls is skipped. The first group whose control condition evaluates to true is processed; if none evaluate to true, the #else group is processed, if present. For example,

void print_debug_level() {
#ifndef DEBUG_LEVEL
   puts("Debug level not defined");
#elif DEBUG_LEVEL == 0
   puts("Debug level 0: No debugging");
#elif DEBUG_LEVEL == 1
   puts("Debug level 1: Basic debugging");
#elif DEBUG_LEVEL == 2
   puts("Debug level 2: Advanced debugging");
#else
   puts("Unrecognized debug level");
#endif
}

When directives are embedded within a group, a common convention to aid readability is to add a space between the # and the directive name for each level of nesting. Here is an example of this from limits.h,

#if defined __USE_ISOC99 && defined __GNUC__
# ifndef LLONG_MIN
#  define LLONG_MIN       (-LLONG_MAX-1)
# endif
# ifndef LLONG_MAX
#  define LLONG_MAX       __LONG_LONG_MAX__
# endif
# ifndef ULLONG_MAX
#  define ULLONG_MAX      (LLONG_MAX * 2ULL + 1)
# endif
#endif

Source File Inclusion

Other header or source files may be included directly into a translation unit using the #include directive. These directives are processed recursively in included files. The header file of an include directive may either be enclosed in double quotes or angle brackets,

#include "stdio.h"
#include <stdio.h>

Each form searches for the specified file in an implementation-defined manner; if the double-quoted version’s search fails, it falls back to the angle-bracketed version’s search method.

The typical implementation is such that the angle-bracketed version searches a set of system directories for the specified file (typically /usr/include on UNIX systems), while the double-quoted version searches local project directories before falling back to system directories. These directories are configurable at compile-time via preprocessor options. By default, the double-quoted version searches for files relative to the current source file, and is usually used to include headers internal to a particular project. Compilers provide mechanisms for specifying which directories are searched for both types of include statements.

Macro Replacement

Macros perform simple token-replacement on the source file. Macros are defined with the #define directive, and undefined with the #undef directive. Macros that are already defined must first be undefined before they can be redefined. There are two types of macros, called object-like and function-like macros,

#define YEAR 2023                  /* Object-like macro */
#define MAX(a, b) (a > b ? a : b)  /* Function-like macro */

When an macro is encountered, it is replaced with its definition. In the above example, an identifier token YEAR would be replaced with the preprocessor-number token 2023. Notice that this is taking place after lexical parsing of the source file, so something like THE_YEAR would not be replaced with THE_2023.

Function-like macros work similarly, except that they also perform argument substitution. The number of arguments must match the definition previously provided, and each argument undergoes macro replacement before the function-like macro is evaluated. The function-like macro invocation is replaced with its definition, where every instance of a named parameter is replaced with the corresponding argument. In the above example, MAX(1,2) would expand to (1 > 2 ? 1 : 2).

Finally, after replacement, the new tokens are parsed and scanned, and then undergo recursive macro replacement. However, at each level of recursion, a previously expanded macro cannot be re-expanded, so infinite loops are not possible. For example,

#define X Y
#define Y Z
#define Z X

X Y Z

would expand to Z X Y.

Function-like macros also support two special operators called the stringification (#) and the token-pasting (##) operators. In the replacement list of a function-like macro, if a # precedes one of the parameter names, then that parameter will be enclosed in double quotes after replacement; any embedded double quotes will be escaped as necessary,

#define STRINGIFY(x) #x

STRINGIFY(1) /* "1" */
STRINGIFY(2) /* "2" */
STRINGIFY("Hello world!") /* "\"Hello world!\"" */

Also within the replacement list of a function-like macro, if a ## appears between two parameter names, then those parameters are concatenated,

#define PASTE(a, b) a ## b
PASTE(123, 456) /* 123456 */

#define MY_MSG "Hello!"
PASTE(MY, _MSG) /* "Hello!" */

Other Directives

The #line directive is used to set the internally tracked source file number, and, optionally, file name. This directive is frequently used when C source code is generated from templates or other meta-programming tools, so that diagnostic messages refer to the appropriate locations in template or code generation program files rather than particular lines in generated code.

The #error directive causes preprocessing to fail, and emit an optional error message. This directive is frequently used to prevent compilation when certain features are missing or incompatible with the given program.

The #pragma directive provides support for implementation-defined behavior; typically, the first token in the #pragma directive is the name of a custom directive supported by a particular compiler. One widely implemented pragma is #pragma once which is used to prevent a file from being included more than once in a single translation unit, and often appears at the top of header files.

The # directive, with no directive name, is the null directive, and does nothing. Additionally, most implementations of the preprocessor ignore directives that start with numbers, such as # 123; these are call line markers, and are used internally by the compiler to generate debugging information.

Final Preprocessing

After all preprocessor directives are processed, and all macro replacement is complete, the preprocessor concatenates adjacent string literals; in other words, "hello" " " "world" is concatenated to "hello world". This is especially useful when using preprocessor macros to build strings from separate parts,

#define AUTHOR Benny Beaver
#define DATE 2023

/* "Copyright (c) 2023, Benny Beaver. All rights reserved." */
char const *copyright = "Copyright (c) " DATE ", " AUTHOR ". All rights reserved.";