ICS 45C Spring 2022, Notes and Examples: Structures

Includes a code example with the moniker Structures

Background

Just as we have a need for homogeneous data structures — collections of some number of elements of the same type — there is also a common need for heterogeneous data structures, where we bring together multiple values of potentially different types into a single entity. Ideally, these values have names that differentiate them from one another in a readable, meaningful way, though there are sometimes other ways to access them (e.g., tuples in Python, whose values are accessed by an index).

Once we pull together related data of different types into a single object, we can now access all of it using a single variable, pass it to a function with a single argument, and so on. This can be a very big benefit, relative to the alternative of storing many variables and passing many parameters; a program is more readable if we can establish a single, simple name for a concept, even one that's more complex than built-in types like ints or doubles.

What are structures?

The simplest solution to a problem like this in C++ is to use something called a structure — generally called a struct, in line with the syntax used to declare them. (There are fuller solutions to this problem, which we'll see soon, but we're starting with the raw materials and working our way up, so structures are a good place for us to start.) A struct brings together a set of members (or, more specifically, member variables). Each member variable has both a name and a type, with the names specifying how each of them is to be accessed or modified, and the types specifying what can and can't be stored in (and done to) each of them.

As a quick example, consider the following struct that represents a calendar date. (Aside: I wouldn't generally implement dates this way, except perhaps in "last mile" code in a user interface, because dates are a fair bit more complicated than just being three integers.) This is a declaration of that struct; remember that declarations specify the existence of something (in this case, a type called Date, which is a struct containing three unsigned integers) without concretely giving it life (i.e., without creating a variable of the new type).

struct Date
{
    unsigned int year;
    unsigned int month;
    unsigned int day;
};

Note the semicolon at the end of the declaration. That's no joke; it has to be there! Without it, you may find yourself getting some fairly strange-looking error messages out of a C++ compiler, which may be focused on the things that directly follow the struct declaration rather than the source of the problem. (The reason why the semicolon has to be there is a bit of a long story, due to the history of the syntax of C, which allowed a struct declaration to be followed directly by the name of one or more variables that would take on that struct's type. The semicolon is necessary so the compiler knows that you don't intend to do that.)

Before we go too far with this, it's important to reiterate that this is one way in C++ to create a new data type. There is a new type in our program called Date, and C++ doesn't draw a meaningful distinction between built-in types like int and the types we create; all data types are on a fairly equal playing field. (This is different than you might expect, especially if you've programmed in Java before, where "objects" are distinguished clearly from values of "primitive types." C++ lacks this distinction altogether; even ints are "objects" in the C++ sense.)

Statically allocating a struct

Once you've declared a struct, you've created a new data type, which means that you should now be able to instantiate it — meaning that you should be able to create variables of the new type. As with variables of built-in types like int, you have your choice about where the variables should be allocated, the most common two choices being statically-allocated on the run-time stack or dynamically-allocated on the heap. Given that Date structs are relatively small, we should tend to prefer them to be statically allocated, though it's sometimes unavoidable that we'll need them to be dynamically allocated instead (e.g., when their lifetime doesn't match the lifetime of some function), so we should know how to do both.

Statically allocating a Date struct on the run-time stack is just like declaring an int variable.

void foo()
{
    Date d;
    ...
}

The compiler is able to determine how much memory is required for the struct by deciding on a layout of its members. The members are laid out in sequence, one after another (though sometimes with empty space called padding intentionally left between them, added to align the values in memory so they can be accessed more efficiently). On the ICS 45C VM, unsigned integers are four bytes, so we might expect a Date struct to be 12 bytes, four for each of the unsigned integer members. Because of this, the struct is no more expensive — in terms of time or memory — than having three separate unsigned int variables, which means we don't have to worry that we're costing ourselves performance to achieve an abstraction that otherwise simplifies our program. Being able to do this is all upside.

Having declared a Date variable as above, its members' values will be initialized in whatever way is normal for the members' types. A Date contains three unsigned integers; unsigned integers are not necessarily initialized automatically (e.g., they may take on whatever value was in memory previously) so the Date begins its life with undefined values stored in it. We can set the members to new values individually by name, using the "." operator.

void foo()
{
    Date today;
    today.year = 2016;
    today.month = 10;
    today.day = 2;
    ...
}

We can read the "." to mean "The thing on the left has members and I'd like to access the member named on the right." So, for example, today.year indicates that we believe today to be an object with members (it is!) and that we'd like to access its year member (which it has!). Since the compiler is aware of the layout of the struct — not only the names of the members, but their relative positions in memory inside of it — it will know precisely where to go. today.year is an lvalue (i.e., it has a storage location), so we can assign to it or read from it.

Alternatively, we can initialize a struct's members at the point where we create it using C++'s uniform initialization syntax — which, as we'll see, standardizes the way that objects of most types can be initialized.

void foo()
{
    Date today{2016, 10, 2};
    ...
}

The construct {2016, 10, 2} is called an initializer list, which specifies a sequence of values that are to be used to initialize today. Using an initializer list with a struct causes the struct's members to be initialized in the order they're declared in the struct declaration — so, year, then month, then day.

Note that there are a couple of consequences of using initializer lists in this way. One consequence is that a change in the order of the members in the struct declaration might radically change how your code behaves without you receiving a compile-time error. Another is that you've removed the members' names from your point in your code where initialization happens, which can sometimes make it harder for a program to read; you could argue that it's fairly clear what 2016, 10, and 2 are intended to be here, but even a simple case like this isn't entirely obvious, since dates are typically written with their components in different orders in different parts of the world. A middle ground called designated initializers will be introduced in C++20, which would allow you to name the members you're initializing.

void foo()
{
    // Not legal yet, but will be legal in C++20
    Date d{.year = 2016, .month = 10, .day = 2};
    ...
}

What happens when a statically-allocated struct is deallocated

When a statically-allocated struct falls out of scope, it is destroyed automatically, taking its members with it. Note, though, that if its members are pointers, the pointers will be destroyed, but the objects the pointers point to will not! In this sense, the members of a struct are destroyed the same way that separate variables of the same types would be destroyed.

Dynamically allocating a struct

Structs can also be allocated dynamically, as well, using the same new operator that we've seen previously. As before, the new operator returns a pointer to the newly-allocated struct.

Date* d = new Date;

In this case, 12 bytes will be allocated on the heap and a pointer to the 12-byte block will be returned.

Accessing the members of a struct through a pointer is a little bit trickier.

d.year = 2005;     // Illegal, because d is a pointer; it has no member called year
*d.year = 2005;    // Also illegal, because "." has a higher precedence than "*"
(*d).year = 2005;  // Legal, but ugly as sin!

Only the last of these says what we actually mean: "Go to where d (a Date pointer) points and then access its year member." But it's a really unfortunate piece of syntax, because it's ugly and error-prone. So C++ includes another operator, the "arrow" operator (->) that means the same thing: "The thing on the left is a pointer to an object with members. Go to where it points and then access the member named on the right."

d->year = 2005;    // Much nicer!

Similar to statically-allocated structs, you can use an initializer list to give values to the struct's members at the time of creation.

Date* d = new Date{2016, 10, 2};

What happens when a dynamically-allocated struct is deallocated

As with other dynamically-allocated objects, we'll need to deallocate the struct when we're done with it. We do this the same way that we deallocate other dynamically-allocated objects, by using the delete operator.

delete d;

The Date object will be deleted, taking its members with it, but note again that if any members are pointers, the pointers will be destroyed, but the objects the pointers point to will not! If you want to delete those, that'll be up to you, and you'll need to do it before you destroy the struct containing those pointers.

Passing structs as parameters

A struct can be passed to a function in the same ways that an int can: by value (meaning that a copy of the struct is passed), by reference (meaning that a reference to the struct is passed, so changes to the struct inside the function take effect in the caller), and so on. Pass by value offers the guarantee of value semantics (changes inside the function do not affect the caller), though this can be expensive if the struct is large, so in those cases, we might prefer passing the struct by reference (with const, if we want to protect the caller's struct from being changed).

Syntactically, passing a struct as a parameter is unsurprising.

void foo(Date d)
{
    // d is a copy of the struct passed as a parameter
}

void bar(Date& d)
{
    // modifications to d take effect in the caller
}

var baz(const Date& d)
{
    // d is not a copy, but it can't be modified
}

More about how structs are laid out in memory

Given a struct declaration, a C++ compiler needs to decide on a layout for its members, which is to say it'll have to decide two things:

How much memory will be used to store one instance of the struct?
Where, within that amount of memory, will each of the struct's members be stored?

The answer to these questions should be the same for every instance of the struct, because the compiler will have to emit code that allocates them, manipulates them, passes them as parameters, and so on. So, for example, let's consider this short function:

void foo()
{
    Date d{2005, 11, 1};
    std::cout << d.year << '-' << d.month << '-' << d.day << std::endl;
}

When the compiler is deciding on the layout for foo()'s activation record, it'll need to know how big a Date object is. When it needs to emit code that accesses the year, month, and day fields, it'll need to know where they are within the Date object that's been allocated on the run-time stack.

As with many such details, there are no hard-and-fast rules; they're left loosely defined, so compiler implementers can emit code that is the most efficient (in terms of memory usage or time, which are quite often traded off against one another). But there are some aspects of struct layout that are well-defined, even if the details are left open.

C++ guarantees that a struct's members will be laid out in memory one after another, in the order declared. That means the minimum size of a Date struct would be the sum of the sizes of its members. Since a Date struct has three unsigned int members and, on the ICS 45C VM, unsigned ints are four bytes each, a Date object will be at least 12 bytes. If we're curious, we could find out by running this code:

std::cout << sizeof(Date) << std::endl;

and, indeed, on the ICS 45C VM, this will print 12, though you might get different output on a different compiler or a different operating system.

It's not always the case that struct members are crammed into memory as tightly as possible. Consider instead this struct declaration:

struct X
{
    char a;
    int b;
    short c;
    double d;
};

On the ICS 45C VM, I declared this struct and then ran the following code to gather more information about the struct's layout in memory.

X example;
std::cout << sizeof(example.a) << std::endl;
std::cout << sizeof(example.b) << std::endl;
std::cout << sizeof(example.c) << std::endl;
std::cout << sizeof(example.d) << std::endl;
std::cout << sizeof(X) << std::endl;

The sizes of the individual members mirror the sizes of the corresponding built-in types: a is 1 byte, b is 4 bytes, c is 2 bytes, and d is 8 bytes. This might lead one to believe that the size of an X struct is 1 + 4 + 2 + 8 = 15 bytes. However, the last line of output tells a different story; the size of an X struct is listed as 24! So what's going on here?

On many processor architectures, memory accesses are significantly faster if they are done on appropriate boundaries. While the rules below are hypothetical, they're not entirely uncommon in reality.

Accessing any single byte is generally about as fast as accessing any other one.
Accessing two bytes is best done on a two-byte boundary (i.e., a memory address that is a multiple of 2).
Accessing four bytes is best done on a four-byte boundary (i.e., a memory address that is a multiple of 4).
Accessing eight bytes is best done on a eight-byte boundary (i.e., a memory address that is a multiple of 8).

Consider, then, where the members of an X would be stored if one was placed immediately after another:

a would be at the beginning of the struct (i.e., at an offset of 0 bytes from the beginning)
b would be just after a, so at an offset of 1 byte from the beginning, since a is 1 byte
c would be just after b, so at an offset of 5 bytes from the beginning, since b is 4 bytes
d would be just after c, so at an offset of 7 bytes from the beginning, since c is 2 bytes

The problem is that accessing the four-byte value of b, the two-byte value of c, and the eight-byte value of d would all be significantly slower than they need to be, because none of them are aligned on a proper boundary. So C++ compilers quite often introduce padding into a structure, bytes that are intentionally left unused, but that serve to improve access speeds to the individual members by placing them at the appropriate boundaries. Since not all architectures have the same rules for this, there is no standard way that C++ pads structures — and many compilers let you turn this feature off, selectively for individual structs, if you're more interested in saving memory or aligning things in particular places (e.g., if you're communicating with hardware that requires a very particular layout) than you are in access speed.

So the reason that the size of an X struct is 24 bytes is because of padding introduced into it. If we're really curious about where the padding is, we could even print out the addresses of individual members and see where they are in relation to each other, then use that to deduce where the padding is. For example:

std::cout << &example << std::endl;
std::cout << &example.b << std::endl;
std::cout << &example.c << std::endl;
std::cout << &example.d << std::endl;

When I tried this on the ICS 45C VM, I was able to see that the members were aligned as follows:

a is at offset 0 and has size 1
b is at offset 4 and has size 4
c is at offset 8 and has size 2
d is at offset 16 and has size 8

So, out of the 24 bytes used to store an X, 9 of those bytes (3 after a and 6 after c) are padding.

A disclaimer about structs

The actual rules in C++ differ a bit from what I've said here. Structs are more capable than this — and, as we'll see, they are technically indistinguishable from another feature called classes, except in only the most minor ways — but, as a practical matter, structs are most often used in the way we've seen here; we'll use them to provide the ability to bring a set of public member variables together and nothing else.

We'll only be using structs in this way this quarter; whenever we want something more full-featured, we'll use classes instead. (Note, too, that this is all structs can do historically; in C, there are no classes, while structs are limited, more or less, to what you see here.)

The code

The official moniker for this code example is Structures, so your best bet is to do this:

Create a new project using the example project template.
Change into the project directory.
Issue the command ./download Structures to download the code example into your project directory's app directory.

Alternatively, you can click the link to the tarball below:

Structures.tar.gz