ICS 46 Spring 2022, Notes and Examples: Randomness

What is randomness and why do we want it?

The word random and its cousins, like randomness, are thrown around quite a lot in day-to-day speech, but, from a computing perspective, randomness actually has a somewhat more precise definition than we may normally be accustomed to using. We say that a sequence of values is random if we cannot deduce any sort of pattern that describes it. Statistics has much to say about what constitutes randomness — for example, how varied do the values of a sequence have to be? — and the deep specifics are beyond the scope of our interest at the moment, but what's important is that we can obtain a sequence of values that is varied and unpredictable when we need it.

You might wonder why you would ever want something like this in the first place. What good is a sequence of values whose values you can't predict? It turns out that such a sequence can solve a diverse set of interesting and important problems. A few examples follow.

Encryption is how we keep communication secret, so that it can be understood only by the parties involved, while appearing as gibberish to eavesdroppers. The cornerstone of many forms of encryption is the selection of a key, which can be used in conjunction with an encryption algorithm to translate the original text into a "scrambled" version that can be read only by another party who has that same key (or a compatible one), and only by running a corresponding decryption algorithm to obtain the original text from the "scrambled" version. If the keys are selected in a manner that's predictable, other parties will be able to guess the key; if the keys are selected in a manner that's random, that problem becomes much harder to solve, because there will be no foreknowledge of what the key might be.
Many computer games rely on randomness to generate at least some of the behavior of the game world you interact with. Particularly in the case of the elements of that world that aren't controlled by human players, randomness allows the gameplay to be varied, so the game can feel at least a little bit different every time you play it; computer-controlled characters might each behave a little bit differently, certain items may fail some of the time when you try to use them, and so on.
Computer simulations quite often make use of randomness. For example, we may wish to run many separate simulations starting from the same point in time, but randomly change some of the behavior in ways that correlate with the statistical properties of what's being simulated (e.g., a simulation of cell reproduction might introduce slight DNA mutations at the same frequency as has been observed in real-world studies), so we can gain an understanding of how frequently certain effects might be seen in the real world.

The need is clear. The question is what kind of random sequences we might want when we're solving real problems and how we obtain them in a C++ program. A fairly recent version of the C++ standard, C++11, introduced a new library for generating random sequences of values. The new library, being superior in (more or less) every way to its ancient predecessor from the Standard C Library, is worth our time to learn a little bit about. But, first, we need to understand more about how computers generate randomness. Like a lot of things, it helps to understand the problem before you investigate the solution.

Entropy

First things first: If we want to generate a sequence of random values, where do we get it from? It's actually not as simple as it sounds. If you think about the programs you've written, you may have noticed that they're mainly deterministic, which is to say that they always do the same thing given the same inputs in the same situation. Algorithms tend to be this way; we tell a computer exactly what we want done, and exactly how we want it done, and the computer does it that way. But when we want varied behavior, that's a tougher nut to crack, unless we have the right tools.

What we need is a source of entropy: a sequence of bits that is unpredictable — that is, if we picked any bit in that sequence, it would have an equal probability of being a 0 or a 1, and even if we saw a huge number of those bits, it would give us no way to predict what the next bit would be. This is actually easier said than done. Where do we get this magical source of entropy from? The answer, in practice, is varied. Some operating systems gather entropy by observing aspects of their internal operation — mouse movements, network traffic, hard drive movement, and so on — that would be difficult to predict. Some computer hardware gathers it by observing other ambient factors, like tracking small fluctuations in temperature or other physical effects over time.

The problem, though, is that our sources of entropy are limited. In a given space of time, there are only so many operations that our operating systems can observe, and there are only so many meaningful measurements of physical effects that can be made. So, unfortunately, our primary sources of entropy might not be enough to supply us with the randomness we need. If we're running a high-powered simulation that requires many megabits of randomness every second, we may run out of entropy, which means we either have to run our simulation more slowly, or we have to be more clever about how we obtain the randomness we're looking for.

Pseudorandomness

There's an important detail worth considering, which sounds philosophical but is actually vital. Do we need a sequence that is actually random, or will the appearance of randomness be enough? If I chose a ridiculously long sequence of n bits and you had no reasonable way to use the first n − 1 bits to guess what the n^th bit would be, would you care whether or not I used a deterministic algorithm to produce it? In practice, the answer is generally "no." What we want is the ability to generate values so that they appear to be random; we want them to be varied, and we want it to be essentially impossible to guess what the next one will be, if all you've seen are the values that have been generated so far. Whether they're coming directly from a source of entropy or from a straightforward algorithm is irrelevant for most uses.

You might wonder, though, how a deterministic algorithm — one that always yields the same outputs given the same inputs — could ever generate anything but a predictable sequence of results. The answer lies in how we choose the algorithm, and also in how we start the sequence.

Suppose the first input to our algorithm comes from a legitimate source of entropy, such as the ones described in the previous section; we'll call this value the seed. Now suppose that our algorithm is designed in such a way that it will generate a sequence that has the following properties:

At each step, our seed value is used as input to a calculation. The result of that calculation is returned from our algorithm and it becomes our new seed value, so that it becomes the input to our next calculation.
If the algorithm is used to generate a long sequence of values, that sequence will satisfy statistical tests of randomness (i.e., it will demonstrably "appear" to be random) and will have a long periodicity (i.e., we can generate a very long sequence of values before the sequence begins to repeat).

Such algorithms are known as pseudorandom generators, because while their output isn't random, it nonetheless passes statistical tests of randomness. As long as you seed them with a legitimate source of entropy, they can generate fairly long sequences of random values without the sequence repeating, and since you couldn't easily guess the seed (assuming your entropy source was legitimate), you wouldn't be able to guess any of the others, either. How to design such an algorithm is well beyond the scope of this course, but, fortunately, we can stand on the shoulders of giants and use well-known algorithms that have already been designed for this purpose, such as Mersenne Twister or a Linear Congruential method.

Distributions

So far, we've developed a nice set of ideas for generating a long pseudorandom sequence of bits:

Using a (quite possibly limited) source of entropy, seed a pseudorandom generator.
Use the pseudorandom generator to generate the sequence of bits we need.

There is one more problem we need to solve, though. In practice, we generally don't want just a long sequence of bits; instead, we want a sequence of values that has additional properties.

If we're simulating a game involving dice, we might want to generate a sequence of unsigned int values in the range 1 to 6 (inclusive).
If we're simulating people arriving at a movie theater, we might want them to choose the movies they want to see on the basis of some notion of popularity, with some more popular (and, thus, chosen more often) than others.
If we're implementing a computer game with a large population of non-human-played characters, we might like them to have varied characteristics, so that some are better than others at various tasks. Perhaps we'd like them to be distributed normally (i.e., most being closer to the average, with fewer outliers).

So when we're solving actual problems that involve randomness, what we really want is a distribution of random values that satisfies our actual needs. And that's actually tricky to get right; taking a sequence of random bits and turning it into a sequence of the kinds of values described above, if done improperly, will yield results that are biased or just plain incorrect.

Ideally, we'd have a function we could call for this purpose, one that would take a sequence of random bits and turn it into the distribution of random values that we actually need.

Putting it all together in C++11 and later

All of these concepts that we've been talking about describe the various parts of the <random> library that was added in the C++11 standard. Once you know how these pieces fit together, there are only a few minor details left to get right. In C++11:

A source of entropy is called a random_device. A C++ implementation is free to implement this in any way, though the goal is for it to use a legitimate source of entropy (e.g., operating system or hardware) whenever available. Again, sources of entropy are usually somewhat limited — they can't generate a long sequence of bits in a short amount of time and maintain their randomness — so we use these sparingly.
A pseudorandom generator is called a random engine. Different engines use different algorithms, and you can choose from among several that are built into the library, though there is always a default_random_engine, which is a good choice if you don't have a good reason to choose one algorithm in particular. And, of course, you can build your own, if you'd like, though this is pretty rarely going to be a good idea.
Finally, there are random distributions, which take a sequence of pseudorandom bits from a random engine and generate values that conform to our actual needs. There are a number of different distributions built into the library, and you can build your own, too, if you'd like.

So, if you want to generate a pseudorandom sequence of rolls of a single six-sided die, you could do something like this.

#include <random>

...

// Creates an object that lets us tap into our source of entropy.  Note
// that we only want to do this once and then use it sparingly.  Once created,
// the device can be called like a function.
std::random_device device;

// Now that we want to generate a pseudorandom sequence of bits, we seed a
// random engine using our random_device.  While we wouldn't want to use the
// random_device over and over again, we can use it to seed a pseudorandom
// generator, and then let the algorithm's properties of seeming randomness
// take over from there.
std::default_random_engine engine{device()};

// Finally, we need a distribution, so we can specify what kinds of values we
// actually want.  A uniform_int_distribution is one that generates integer
// values between a given minimum and maximum (inclusive), so, for example,
// the one below generates values between 1 and 6 (inclusive), with each of
// those possible values being equally likely.
std::uniform_int_distribution<int> distribution{1, 6};

// Now that we have all of the pieces set up, we're ready to generate our
// sequence of die rolls.  Notice that the distribution can also be called
// like a function whenever we want our next value, and that we pass the
// engine as a parameter.  When the distribution needs more pseudorandom
// bits, it asks the engine for them, so we don't have to worry about that
// part of it; we simply say "Give us numbers between 1 and 6" and we'll
// get a non-biased, uniformly-distributed sequence of numbers.
for (int i = 0; i < 1000; ++i)
{
    std::cout << distribution(engine) << " ";
}

std::cout << std::endl;

Remember what's really going on here; this isn't something you just want to copy and paste without understanding what it does. When you're solving problems that involve randomness, be sure you're clear on what each of these parts actually does. If, for example, you did all of this — create a random device, then an engine, then a distribution — every time you generated a new value, then you'd essentially be using the random device every time, and you'd quickly run out of meaningful entropy.