Direct i/o

Direct i/o is used to read or write arrays of objects in a single operation, and may be used with raw text or raw data.

The size_t fread(void *buf, size_t size, size_t nmemb, FILE *stream); and size_t fwrite(void const *buf, size_t size, size_t nmemb, FILE *stream); functions each read or write nmemb items of size bytes to stream, respectively. The return value is the number of items successfully read or written; the return value may be a short count if an error or end of input occurs.

These functions are equivalent to, but much faster than, calling fgetc or fputc in a loop, size times for each of nmemb objects. A hypothetical implementation is shown below,

size_t
fread(void *buf, size_t size, size_t nmemb, FILE *stream)
{
  unsigned char *_buf = buf;
  size_t ret = 0;
  for (; ret < nmemb; ++ret) {
    for (size_t i = 0; i < size; ++i) {
      int c = fgetc(stream);
      if (c < 0) goto end;
      *(_buf++) = c;
    }
  }
end:
  return ret;
}

These functions were originally intended to operate on record-like data which would be stored in a file in fixed-size records, and read in record-sized chunks directly into arrays of data structures–thus the separate size and nmemb arguments. Each would return the number of these records that were read or written. For example,

struct my_data {
   int x;
   int y;
};

/* ... */
struct my_data[10];
fwrite(my_data, sizeof (struct my_data), 10, outfile);

In modern programming practice, however, data is stored, retrieved, and transmitted as sequences of bytes[1], and programs are responsible for converting data to and from a generic byte-sequence representation as needed (marshalling). For this reason, the size argument should always be 1 (recall that sizeof (char) == 1 by definition) and nmemb should be the number of bytes to be read.

As an example of what could go wrong if trying to store and retrieve data directly, many aspects of how data types are laid out and represented in memory are up to the implementation, for example the endianness and size of integer types, or the padding between members of a struct. Reading and writing data as anything other than marshalled bytes could result in errors, even among two versions of the same program compiled with slightly different compiler configurations. For example, the following is an example of an unsafe data handling practice,

int my_ints[10] = { /* ... */ };
size_t num_written = fwrite(my_ints, sizeof (int), 10, my_ints_file);
/* ... */
size_t num_read = fread(my_ints, sizeof (int), 10, my_ints_file);

Instead, the data should be converted to a sequence of bytes before writing, and from that sequence of bytes after reading. A simple and widely used approach is to convert data into and back from a plaintext representation, with the upside that the data is also easy for a human to understand. This is often preferred also for data types that have implementation-defined sizes, such as int. Fixed-width data types, such as uint32_t, on the other hand, are more amenable to direct conversion to a byte sequence for storage and retreival; e.g. we could encode and decode a uint32_t data type as shown below,

uint32_t
marshall_uint32_t(uint32_t in)
{
  uint32_t out = 0;
  unsigned char *c = (void *)&out;
  for (size_t i = 0; i < sizeof out; ++i) {
    c[i] = in;
    in >>= CHAR_BIT;
  }
  return out;
}

uint32_t
unmarshall_uint32_t(uint32_t in)
{
  uint32_t out = 0;
  unsigned char *c = (void *)&in;
  for (size_t i = 0; i < sizeof in; ++i) {
    out <<= CHAR_BIT;
    out |= c[sizeof in - i - 1];
  }
  return out;
}

These two functions convert the value of a uint32_t object into a sequence of bytes corresponding to its little-endian representation. On most modern systems, that use little-endian for data storage, these functions do nothing. However, on systems that use big-endian or another obscure format, these functions will properly convert that data for storage and retrieval.