ICS 45C Spring 2022, Notes and Examples: Linked Data Structures

What are linked data structures?

In this course, we've seen that one way to organize a homogeneous collection of objects is to store them in a single-dimension array. Organizing elements this way certainly has its upsides:

Accessing an element given its index can be quite fast, since its address can be calculated with a simple formula
Particularly if the elements are somewhat small and they're being accessed sequentially, the fact that they're stored contiguously can be a major performance win, since a block of many elements may be brought into the cache simultaneously when one of them is accessed, dramatically reducing the cost of accessing others that are nearby.

However, using single-dimension arrays also requires you to give some things up. Like most decisions you make when designing and writing programs, there are tradeoffs involved, and it's wise to know what those tradeoffs are.

An array's size can't change once it's been allocated. That means you either have to know exactly how much storage you need, or that you are willing to give up some excess storage to give you "growth space" if you need it. It also means that you may need to do a lot of reallocation — building larger or smaller arrays as needed — if the amount of space you need varies widely and you want the amount of memory you use to be in line with the amount you actually need. (There are ways to mitigate these costs, but they can't, in general, be eliminated.)
Accessing an array element can be quite fast if you know the index of the element you want, but for scenarios where searching is required, arrays become more problematic, since you're left with two choices that each leave something to be desired:
- Storing the elements in no particular order, which means that the only kind of search you can do is a linear one (i.e., look at the first element, then look at the second, and so on, until you find what you're looking for), since you'll have no meaningful insight about where you should be looking. Such searches take O(n) time.
- Storing the elements in order (e.g., by some kind of unique identifier, like a student ID), allowing you to use a binary search instead, which reduces the time to search n elements to O(log n), at the cost of making insertions and removals take O(n) time, because of the need to shift elements to create or eliminate empty space.

There are substantial use cases where single-dimension arrays (or other collections layered on top of them, such as std::vector) are a great choice, but there are also plenty of scenarios where they aren't. For that reason, it's good for us to understand other techniques we might use. There are a lot of these techniques — and we teach an entire course (ICS 46) that deals with topics like this one — but many of them are centered around the same basic idea: You can obtain more flexibility — albeit with other potential costs, such as performance — if you don't require the elements of a collection to be stored contiguously in memory, as they are in a single-dimension array. Linked data structures instead store their elements separately in nodes, with each node having the responsibility of keeping track of the location of one or more other nodes. Depending on how you organize these nodes, a linked data structure will use different algorithms and have different performance characteristics, but, in general, if you know the location of one node, you can find all of the others that are linked to it.

Because a linked data structure is organized more dynamically, it can be reconfigured on the fly, with new nodes being added or existing nodes being removed at any time. This allows us to have only the number of nodes we actually need at any given time, though it can be more expensive to find individual nodes, because of the need to look at many nodes before you find the one you want, even if you know where it is in the data structure. ICS 46 will focus more attention on the specifics of the performance tradeoffs than we will here, but you will need to familiarize yourself with linked data structures in this course, since they are a key problem you face in Project #3.

Linked lists

The simplest kind of linked data structure is called a linked list. A linked list stores a one-dimensional sequence of elements; in that sense, they're like arrays. Unlike arrays, however, linked lists don't impose any restriction on where the elements are stored in memory. Instead, each is stored in a node, and each node has the additional responsibility of keeping track of where the next node in the sequence is. Given the location of the first node, you can then find any other node, simply by iterating through them (i.e., the first node can tell you where the second one is, the second can tell you where the third is, and so on); in this way, you can find the location of the i^th node in the sequence in O(i) time.

Conceptually, a linked list looks like this:

Linked list diagram

This linked list is built out of three nodes, each being a separate object that stores two things: its data (some value that we wanted to store in the list) and its next (the location of the next node). The last node's next value indicates that no node follows it.

There's one additional idea, which is really important in practice, that is embodied by this diagram. We can't find any of these nodes unless we know the location of the first, but we can find all of them once we know where the first one is. That's why the diagram includes one additional piece of information called head, which is not itself a node, but that allows us to find the first node. If there was an object that represented our entire list (e.g., we were building a class called LinkedList), it would probably contain (at least) the location of that first node, though the nodes themselves would be separate objects.

At any given time, we only need as many nodes as we have elements in the list; there is no need to reserve "extra space." We can add new nodes to the list by allocating them in memory and then "linking" them into the list; for example, if we allocate a new node and then store the location of the first node as the new node's "next," we've inserted a new node into the list before all of the others. Similarly, we can insert nodes in other positions in the list, or remove nodes from any position in the list, without affecting the location of any of the other nodes; changing "links" is sufficient to reconfigure the list's "shape" on the fly.

There's one more thing we need to understand. While the diagram above implies that the location of the nodes is formulaic (one after another), that's not really true; each of these nodes might be stored anywhere in memory, regardless of where the others are, and the job of the "arrows" in the diagram is to keep track of these locations, wherever they are.

Implementing a linked list in C++

Linked lists are the same, conceptually, in any programming language in which you want to implement them; as long as there's some way to allocate the nodes, and as long as there's some way for each node to keep track of the location of another, we can build a linked list. However, the fine-grained details of the implementation will differ from one language to another, so it's worth taking a brief look at how these details might be implemented in C++. Since I'm asking you to implement a data structure that includes linked list functionality in Project #3, I'll forego providing a complete implementation — I want you to have the chance to work through building it yourself — but we should at least agree on the key concepts.

Implementing nodes

Each node would best be stored in a single object, so that they can be allocated and deallocated one at a time. Since each node contains two values of different types, we need to use a technique that lets us store a heterogeneous collection of values; structs are a good choice here, since the nodes themselves don't need to be "smart" (i.e., they don't need member functions or their own "Big Three"), since we can instead empower the list to manage them in combination.

We're going to want the nodes to be allocated and deallocated on the fly during the course of the linked list's lifetime. In C++, that means we'll need to use dynamic allocation, which will necessitate the use of new (to create them) and delete (to destroy them). When we use new to allocate one, we'll get back a pointer to it; when we delete it later, we'll need a pointer to it. So pointers will be a convenient way for us to keep track of the location of a node. In fact, we won't be able to use references for this purpose, since references lack two abilities that are important here: the ability to be changed over the course of their lifetime (e.g., when we need to change some node's next) and the ability to represent the concept of "null" (which is what we'd store as the last node's next, to make clear that no node follows it).

Putting these ideas together, if we wanted each element of our linked list to be an integer, we might design the nodes this way:

struct Node
{
    int value;
    Node* next;
};

Implementing our linked lists

A linked list might be nicely implemented as a class, so that the nodes can be kept completely private — an implementation detail of our linked list, so that we don't have to entrust our entire program to do the right thing with every pointer — and so that we can provide member functions and a "Big Three" that safely utilize and manage the list's nodes.

Once we have a pointer to the first node, we'll be able to iterate through the nodes to find the others, so all we'll need to store in our linked list object is a member variable that is a pointer to the first node. (We might also store other things, such as the number of nodes, if we had the need.)

It will be vitally important that we implement the "Big Three" in our class. The pointer to the first node is not what we call a "well-behaved" member variable; it's a pointer to something that was dynamically allocated by our list, which will need to be destroyed when the list dies and copied when the list is copied.

As a very simple starting point, the design of a LinkedList class might start out looking like the one below; I'm assuming here that each element of the list is an int value.

class LinkedList
{
public:
    LinkedList();
    LinkedList(const LinkedList& list);
    LinkedList& operator=(const LinkedList& list);
    ~LinkedList();

    // ...

private:
    struct Node
    {
        int value;
        Node* next;
    };

    Node* head;     // pointer to the first node
};

Notice that the Node struct has been declared as a private member of the LinkedList class, and that the next member variable is also private; this means that code outside of the LinkedList class won't even be able to say Node. In this way, there's no way to use the nodes of our linked list except within our LinkedList class, which is a good design choice, since it prevents us from spreading an implementation detail (how nodes are structured and linked) throughout our program.

Of course, we'd need member functions to perform the various list operations that we needed, such as adding elements, removing elements, or searching for elements. What you'd write is a bit different depending on the situation, so I've left them out here.

How do we iterate through the nodes?

While I'm leaving the details of implementing individual linked list operations as an exercise for you, there is a common thread that underlies many of them: iterating through the nodes of a list. Understanding the pattern for doing that will help you to solve the problems you encounter along the way.

Suppose that we're implementing a member function in our LinkedList class that calculates the sum of the elements of the list. (In general, this probably isn't a member function we'd want in our LinkedList, since it's so specific; we'd be better off with a generalized mechanism for iteration, which could be extended to solve this problem and many others. But we'll use this as an example, since it requires iterating through the list's nodes.)

How we iterate through a linked list's nodes is to keep track of the location of a "current" node at any given time. We can use that current node's value in any way we need to, then use its next to make the next node be our "current" one. We're using pointers to keep track of locations, so this implies that our "current" node's location would be a variable of type Node*. We can use that pointer's "nullness" as a way to know that we've reached the end of the list.

int LinkedList::sum() const
{
    int total = 0;

    // Begin by pointing 'current' to the first node in the list, which is the
    // node that 'head' points to.  Note that if the list is empty, 'current'
    // will now be nullptr.  Note, too, that the "const" here promises that we
    // won't change the node that 'current' points to, not that we won't change
    // the 'current' pointer.  Remember that we can read these kinds of type
    // declarations from right-to-left: "current is a pointer to a Node that
    // is constant".
    const Node* current = head;

    // When 'current' is nullptr, that tells us to stop (i.e., we've "fallen
    // off" the end of the list.  So we'll loop while 'current' *isn't* nullptr.
    while (current != nullptr)
    {
        // Add the current node's value to our running total
        total += current->value;

        // Update our current node to be the one that follows the current node.
        // This moves us one node closer to the end of the list.
        current = current->next;
    }

    // When we've iterated through the entire list and added every value to our
    // running total, we can return the total; that's our sum.
    return total;
}

One important thing to understand about this example is that we haven't changed anything about any of the nodes. The current pointer has been made to temporarily point to each one, and we've updated that pointer along the way. When the member function ends, the current pointer (a local variable) dies, as it should. But none of the nodes has been changed.

By way of contrast, consider another somewhat nonsensical example of a member function we might add to our LinkedList class: one that doubles every node's value.

void LinkedList::doubleAll()
{
    Node* current = head;

    while (current != nullptr)
    {
        current->value = current->value * 2;
        current = current->next;
    }
}

How we know that we've changed the nodes themselves is that we've assigned to the value member variable of a Node struct. Similarly, if we wanted to change the way that the nodes are linked to each other, we could only change that by assigning to a Node's next member variable.

One more important piece of C++ syntax

There is one other minor syntactic issue that crops up in our design, which I'd like to warn you about if you want to go this route. Notice that we declared the Node struct as a member of the LinkedList class. This means that, from outside of the LinkedList class, its name is actually LinkedList::Node. This may seem irrelevant — if Node is private within LinkedList, when would we ever need to say its name from outside of LinkedList? — but it turns out to be important for one reason.

Suppose you wanted to write a member function in LinkedList that takes a pointer to a Node as a parameter and returns a pointer to a Node. (It's not important what the function does; I just want to show you how to deal with its signature.) You might declare that member function like this:

class LinkedList
{
    ...
private:
    ...
    Node* foo(Node* n);
};

When you implement that in a source file, you might be inclined to try to write it this way:

Node* LinkedList::foo(Node* n)
{
    ...
}

But, unfortunately, this will turn out not to work. Node has no meaning outside of the scope of LinkedList and, technically, while the C++ compiler is reading your code, it doesn't consider itself to be in the scope of LinkedList until after it passes the part of your code where you've said LinkedList::. This means that the first use of Node in the function's signature will return a perplexing error message saying that Node has not been declared, even though the second use of Node (in the function's parameter) will be fine, because it follows where you said LinkedList::.

The solution to this problem is to qualify Node's name where necessary:

LinkedList::Node* LinkedList::foo(Node* n)
{
    ...
}

Note that we only have to do this to the return value, but not to the parameter, because the parameter appears after the name of the function and, hence, is considered to be in the scope of LinkedList.