Fundamentals of Memory and Memory Management Memory Hierarchies: Speed/Size/Cost Tradeoffs To start, we will discuss hierarchies in memory, including tradeoffs between their speeds, sizes, and costs. First, we will discuss the following prefixes for sizes/speeds, which every computer scientist should be familiar with. kilo x 10^3 milli x 10^-3 mega x 10^6 micro x 10^-6 giga x 10^9 nano x 10^-9 tera x 10^12 pico x 10^-12 peta x 10^15 femto x 10^-15 exa x 10^18 atto x 10^-18 So, 1 gigabyte is 10^9 bytes. The following might become relevant for your life zetta x 10^21 zepto x 10^-21 yotta x 10^24 yoxto x 10^-24 There are three standard locations for memory: on the CPU chip (also known as cache memory), in special memory chips (main memory), and external memory (disk drives, DVDs, USB, etc.). Each is slower than the previous, but its size is bigger and its cost/byte is lower. CPU chips act as a cache for main memory; and main memory often acts as a cache for external memory (some external memory devices, like disk drives, also have their own dedicated caches). A book by Van Loan/Fan (Insight Through Computing) made some of these numbers concrete. 1 megabyte: A 500 page novel; 1 minute of MP3 music 1 gigabyte: The human genome; 20 minutes of a DVD 1 terabyte: A University library; photos of all US Airline passengers (1 day) 1 petabyte: The amount of text in the Library of Congress; he says that printing it would take 50x10^6 trees ------------------------------------------------------------------------------ External Memory/Disk Drives: Note that disk drives/DVDs are modeled by a spinning platter that has concentric circles that each store data; there is a "read head" that can move between concentric circles. To read a specific word of data, the read head seeks the correct circle and waits for the data to rotate into the correct position under it. A typical rotation speed is 7,200 rpm. This is roughly 100 revolutions/second, meaning that it takes about 10 milliseconds to complete one full rotation; this is called the "rotational delay". The read head takes about the same amount of time to move into position over the correct circle (called "seek time"). Note that a processor executing 1 billion operations per second can execute 10 million instructions in 10 milliseconds, while the read head seeks and the platter rotates to the needed position, so it can do a lot of work between setting up for the read and actually receiving the data. We will look at algorithms that measure their efficiency not by the number of instructions executed, but by the number of times they must seek data on an external memory device, because the time waiting for data will dominate the time spent in the number of instructions executed. We often talk about the terms "latency" and "bandwidth" when discussing memory access (and transmitting information over networks as well). Latency is the time it takes from a request for data until the first information arrives. Bandwidth is the throughput (data rate) starting when the first data begins to arrive. The latency for a disk drive/DVD can be large (see the 10 millisecond numbers above), because it involves moving something physical: the read head and the platter; but once the read head/platter are in the right position, with the right data is under it, the quickly spinning platter (storing data densely) can transfer all the data on the circle very quickly. Historically, data transfer rates on standard hard drives are ~70 megabytes per second. Because external memory typically has high latency (a long time, measured in machine instructions that can execute before the first piece of data arrives) and high bandwidth (after that, lots of data can be delivered quickly), we typically use every memory request to transfer a BLOCK of memory, not a single word of memory. The expectation is that the subsequent information will be needed soon (certainly the case when reading a sequential file). So, when reading information from a file stored on a disk drive, instead of just reading one character at a time, many characters (tens of thousands to millions) are read and cached in main memory (or the disk drive's dedicated cache; these caches can transfer data to memory at the rate of 3 gigabytes per second -much faster than information can be read as the disk rotates). So, for many subsequent character reads, the information is retrieved directly from memory and the computer doesn't have to perform any more disk operations. Such a cache has a special name, a "buffer". When the buffer is exhausted (all characters read from it), the next character read initiates another block transfer. As we saw, it might take 10 millisecond to move the read head and wait for the data to rotate on the platter under the read head, but it will take little time to read tens of thousands to millions more characters. Another way to look at this: to get one character takes 10 milliseconds, but to get 100,000 characters requires only 10 + 1 = 11 milliseconds (at 100 megabytes/sec: a round number that is a bit more,than the 70 quoted above as the transfer rate). So, the amortized cost of reading one of the 100,000 characters is 11/100,000 milliseconds/character or about .11 microseconds/character (an overall rate of almost 10 million characters/second). If we read 1,000,000 characters into the buffer, the rate would be 20/1,000,000 or about .02 microseconds/character, an overall rate of 50 million characters/second). This analysis is a bit like what we did for putting N values in an ArrayQueue, which doubles its size. Some adds are very quick, but a few require much more time, when the array size doubles. But if you take a look at lots of adds, the average cost is very cheap. Likewise, when reading blocks of characters, reading every new block takes lots of time, but most characters will then be in the memory buffer for quick reading. Here are some current sizes and relative speeds for CPU Chip, main, and external memory (these are very approximate). CPU Chip ~1-10 Mb 10-100 times faster than main memory Main ~1-10 Gb 100K - 1M times faster than external memory (latency) External ~1- ? Tb As a rule of thumb, each step up increases the size by a factor of 1,000 and decreases the speed and cost/byte by a similar factor. When we have analyzed algorithms earlier in this course, we have assumed that all data is in main memory. In fact, often most of the data is on the CPU chip memory, and its performance is often an important practical consideration for determining the time an algorithm will take. With effective caching, often 9 of 10 memory access will occur in the CPU Chip memory, not the main memory. This can speed up the execution by a factor of 10-100. We will briefly discuss the interplay between CPU chip and main memory below, using CPU chip memory to cache information from main memory. ------------------------------------------------------------------------------ Data Access patterns and the CPU Cache: Access patterns for data often exhibits two kinds of locality. Temporal locality: if data is being accessed now, it is likely to be accessed soon in the future; for example, a loop index (or more generally, a cursor) is accessed frequently during the execution of a loop (its value is initialized, checked, and updated in each loop iteration). Spatial locality: if data is being accessed now, data near it is likely to be accessed soon in the future; for example, if we are accessing an ARRAY at position i, it is likely we will access it at position i+1 in the near future (when scanning an array, not doing a binary search in an array). This effect is similar, but not quite as pronounced in linked lists/trees whose nodes were allocated at a similar time (and therefore initially in memory locations that are close by). Scanning all the values in an array, using an index variable, exhibits both temporal (the index variables) and spatial (the array elements) locality. If we can rewrite our program (or write it in machine code) so that it can completely fit in CPU chip memory, it can run much faster than a slightly bigger amount of code that cannot fit in CPU chip memory. The following algorithm is implemented in hardware. It is used whenever the CPU needs to access data. Here we use the term cache for the CPU Chip memory. The cache starts out empty. 1) If the data is already in the cache, use it 2) If the data is not already in the cache a) Retrieve it (and other data near it: some block of memory) As with external memory, there is high latency to get the data, but there is high bandwidth to transfer a block of data from main memory to the cache, so it accesses/transfers blocks of data b) If the cache is not full, add the new block of data to it c) If the cache is full, determine which block of data to remove and add the new block of data to replace it In the future, all memory addresses in the cache can be accessed quickly; an address outside the cache must go through the process above. We need a policy dictating which old data block to remove from a filled cache. Three standard and well-studied policies are Random, First in first out (FIFO), and Least Recently Used (LRU). Any policy must be fairly simple, otherwise it could not be directly implemented in hardware (because of the speeds needed, cache replacement algorithms must be implemented in hardware). The idea is to leave inside the cache any data that is expected to be accessed soon in the future. Random does not bring anything relevant into the decision, but it is simple/cheap to implement (no extra storage). FIFO seems a reasonable strategy: if something was brought in a long time ago, it is less likely to be used compared to something that was brought in more recently (it requires only a simple queue to keep track of which block to replace next). But, LRU gets more to the heart of temporal locality: if something has been used recently, it is more likely to be used in the near future (regardless of when it was initially used, which is what FIFO monitors). While LRU is harder to implement in hardware (it uses something like a priority queue), it can be implemented there and it is a better predictor of what to remove and what to leave in the cache. Because the time needed to locate and transfer a block of data is large (while waiting for the data to arrive, the computer could execute many instructions), choosing a replacement policy like LRU that is less efficient (takes more time to determine what block to remove) but better (determines more accurately which block won't be used in the future) is likely to provide better performance overall. The concept of "prefetching" works for Main->CPU or External->Main memory. If as programmers we know that some data is expected to be used in the near future (but not needed yet) we can prefetch it (touch it so that it loads into the cache). Then, while the computer is doing other things, before the data is actually needed, it will be brought from the slower to the faster memory. The maintenance of caches is an important part of chip design. Cache design becomes even more interesting when multiple cores/processors access the same memory. For example, if 4 CPUs each cache some main memory that they share, and one CPU changes the value (in its cache) the other CPUs that are caching that same memory have to be updated (and main memory as well, eventually). When new cache mechanisms are proposed, they are often evaluated by using previously collected "memory traces" showing which memory locations actual "important" programs generate, and determining how well the caching mechanism works in these cases. Such memory traces can constitute billions, trillions (even more) of memory references of a running program; recall machines can execute billions of operations per second. Likewise, using the concept of "virtual memory" we can consider main memory to act as a cache for external memory. Using virtual memory, we can solve huge problems by just pretending that the computer's memory is as large as its disk-drive's memory (terabytes not gigabytes). Then we use the main memory as a cache for external memory (just as we discussed above using a CPU Chip's memory as a cache for main memory, including replacement policies). Using virtual memory we can "easily" solve problems that do not fit into a computer's main memory, but unless the data structures and algorithms processing the data structures exhibit strong temporal/spatial locality, the run time can be enormously larger (thousands to millions of time longer, as data is shuttled between main and externall memory). In the next two lectures we will discuss data structures and algorithms for fast searching and sorting, when using huge amounts of external memory. Gordon Bell (a famous computer designer) has written a book called "Total Recall: How the E-Memory Revolution will change Everything" (Dutton, 2009). In the book he posits that in the future, everyone can have everything that they ever see and hear (e.g., every conversation that they have), every web-page that they look at, etc. stored in memory and indexed for retrieval. Here is a quote from page 9, early in the book. In fact, digital storage capacity is increasing faster than our ability to pull information back out. Once upon a time, you had to be extremely judicious and stingy about which pieces of data you hung on to. You had to be thrifty with your electronic pieces of information, or bits, as we call them. But starting around 2000 it became trivial and cheap to sock away tremendous piles of data. The hard part is no longer deciding what to hold on to, but how to efficiently organize it, sort it, access it, and find patterns and meaning in it. This is a primary challenge for the engeineers developing the software that will fully unleash the power of Total Recall. Basically, Moore's Law (http://www.intel.com/technology/mooreslaw/) postulated by Gordon Moore (Intel) says that the number of transistors in a given area will double every 18 months. This typically translated into computer speed doubling as well, but not now. It requires too much power to speed up computers. So instead, we use the extra transistors to create more cache memory and more cores (CPUs) on a single chip. They all run at a "slow" speed, but if programmed correctly, to work together, they can accomplish as much as a faster chip. How to coordinate cores is still a problem (some say the biggest practical problem facing computing today). External memory is still growing at a slightly faster pace than predicted by Moore's law; typically every couple of years you can buy twice the amount of external memory for about the same price (with no speed degradation, but also not a lot of speed improvements). Solid state external memory is gaining (cost and performance) on hard disk drives. ------------------------------------------------------------------------------ Stacks and Heaps: Most computer programming languages use memory in two special ways: as a stack and as a heap (NOT the same kind of binary heaps used for efficient priority queues; here the same name is used for something very different). Main memory is really just a giant array of words. A 32 bit word stores an int or a reference; it can also be divided into four, 8-bit bytes, where each byte can store a single ASCII character. Think of all available memory (once the program has been stored in memory) as being divided between stacks and heaps, with the stack on the left growing towards the right, and the heap on the right growing towards the left Memory +---------+----------------------+ | program | Stack -> <- Heap| +---------+----------------------+ We have seen that stacks are used for method calls -including recursive method calls- to store parameters and locals variables; stacks are also used to evaluate arithmetic expressions. Stacks grow and shrink with no "holes": each method call increases the stack size (adds to it) by N locations (storing N parameters and local variables) and each method return decreases the stack (removes from it) by the same N locations. We use heaps for objects constructed by "new". Heaps can have a holes. For example, if we store a Set in an array, initially we allocate an array of a certain size to store the Set; later we might double the length of tha array, allocating another array whose size is twice as big as the first (coming from heap space to the left of the original array). Now the original array is garbage (there will be no references from the program pointing to it) creating a hole in the heap space (that can be reused if we delete [] it: also see the section on garbage collection below). Thus, it is more difficult for programming languages to manage (allocate and reuse garbage) heap space. ------------------------------------------------------------------------------ Basics of (Heap) Memory Management: Whether programmers do their own memory management (as in the C and C++ languages, where they explicitly must "delete" memory they no longer need) or whether an Automatic Garbage Collector (really, a "recycler") does it for us, we can discuss various needs and strategies for recycling memory. First, a free block of memory is a contiguous number of free memory locations. Typically each memory block allocated in the heap has a few words reserved for memory management information. A minimal amount would allows us to store the address and size of each free block of memory and a reference to the next free block of memory (keeping all the free blocks in a linear linked list). Initially we would have a list with one huge block of free memory. When a block of memory is freed (either explicitly because we delete it, or because an automatic garbage collector -in Languages like Python and Java- finds it) we can add it to the linked list of free memory blocks. If we need to allocate a block of memory, before going to the remaining memory in the heap (or after going there and not finding enough memory), we can check whether we can reallocate a block of memory from this linked list of free memory blocks that were previously allocated but garbage collected. We will discuss four strategies below, using the concept of "fragmentation". Memory is fragmented if there are many small blocks (as opposed to a few large free blocks). If memory is fragmented, it is likely to take longer to search for a free block of the necessary size. Here are four policies that decide which memory block to use from the linked list of free memory blocks. 1) First-fit: Search the linked list starting at the begining and stop at the first memory block with enough space. 2) Next-fit: Search the linked list starting wherever the last reclaimed block came from, and stop at the first block with enough space (if we run off the end of the list, start at the front: e.g., circular list). 3) Best-fit: Search the entire linked list and find the smallest block with enough space (or keep the the list sorted by size, or use a hash-like stucture with all blocks 1-2 words, 3-4 words, 5-8 words, 9-16 words, 17-32 words, etc. linked together: bin 30 is holding memory objects whose size is a gigabyte: 2^30). 4) Largest: Use the largest free memory block (sometimes called "worst fit", but not a pejorative). After allocating a memory block of the needed size, the remaining memory in that block goes back on the the linked list of free memory blocks (with a smaller size). So, regardless of the policy, if we need 100 words of memory to allocate for an object, and we decide to use a block that stores 300 words of memory, we allocate the 100 and put the remaining 200 back into free memory. 1) First-fit: initially fast, but can create lots of small memory blocks at at the front of the linked list, slowing down searching. 2) Next-fit: improves on first-fit by spreading fragmentation throughout the linked list (not always at the front). 3) Best-fit wastes little extra space, but tends to create very small memory blocks, possibly unallocatable (because they are too small to be useful) at the front of the linked list. Must search a lot. 4) Largest: can be fast (we can use a priority queue where the largest memory block has the highest priority), and puts the largest memory blocks (thus more easily allocatable in the future) put back in the priority queue. Computer Scientists have created lots of models for managing recycled memory and collected lots of data (in the form of memory-use traces) to simulate and evaluate all sorts of memory recycling policies. Note that when a block memory is deleted, it is a good idea (but it takes a bit of time) to discover whether it is adjacent to another free block, and if so, combining the two blocks into one bigger block. Technically, once you delete a block you should never access any of its information. Some delete operations will overwrite all the data with a special bit pattern (like all 0s) to ensure if you try to access that data in the future you will see a strange result. Others don't take the time to do that because good programmers should never access that information :) I believe the PC's delete stores the strange bit patterns but the Mac's does not. When student access information in deleted blocks on the PC, they often crash, but on the Mac (so long as the block has not be recycled) the code will work. This causes problems when we grade Mac-written solutions on the PC: they often fail on the PC but run correctly (although the programmer is doing something wrong) on the Mac. ------------------------------------------------------------------------------ Garbage Collection: When programmers manage free memory, their code is often prone to error (even good/experienced programmers), creating memory leaks: memory that the program is not using (and no longer has access to) but is also not on a free list for future use: truly garbage that cannot be recycled. Other times programmers delete memory (which might be recycle) while other parts of the program are still using it. Smart pointers in C++ are used to minimize these kinds of mistakes. Sometimes such programs must be stopped and restarted because they run out of memory. There are some mission critical programs that do not allow the use of "new" (after setting up initial data structures) because of the possible memory leaks. During the first Gulf War, a memory leak was found in an anti-missle weapon which would not function well after operating for days (ir was designed to be used in a "fast European war" and not expected to have to work continuously for days at a time). Until the software was fixed, the operators were instructed to shut down and restart the software every few days (of course, when to shut it down was problematic, as the system was inoperable during the minutes required for a shutdown and restart). As I said above, it was designed to operate in Europe, where antimissile batteries would go on alert for just hours at a time, therefore testing it under these conditions failed to show any problems with running it 24/7 (as was needed during the first Gulf War). By using an automatic Garbage Collector (GC) we avoid explicit deallocation; our code calls "new" when it needs memory but NOT "delete" (typically the code just makes some variable refer to a different object; then the original object it referred to -if no other variables refer to it- become garbage/reycleable). Such systems can find all the memory blocks not currently used by a program and put them all on the linked list of free memory blocks. Note that languages like Lisp had automatic garbage collection as early as the 1960s. Note too that for a program that doesn't exhaust memory, automatic garbage collection (which doesn't occur in such a program) can be faster than manual garbage collection, since manual garbage collection requires doing some work on disposing of some memory, while automatic garbage collection does no work on disposal, but possibly more work on recycling garbage when there is no more free memory in the heap; if an application can run without garbage collection, it can take less time. Simple garbage collection can be accomplished by storing reference counts to data, so we know that when a count goes to 0 it can be recycled. Although, a circular structure can have each of its nodes with a non-zero reference count yet no variable refers to any part of the circular structure. Mark and Sweep garbage collectors are fairly standard and simple to understand, but there are many different algorithms for this universally useful task. We will discuss some briefly. In the Mark phase, the GC first finds all the pointers in a program: initially, these are all pointer variables stored in the stack, representing pointers to objects in the heap (from global variables, and parameter, and local variables in executing function/methods). The GC follows these references to the objects that they refer to and marks these objects as "live" (often there is a bit in an extra word associated with each free block of memory to mark whether or not it is live). From these live objects, the GC follows their pointer instance variables to the objects that they refer to in the heap, and marks these objects live as well. The GC continues this process (which is like searching a graph of objects -which point to other objects- for "reachability") until it has marked every object object live that can be reached from parameter/local variables active in the code. This is like the "reachable" computation from Programming Assignment #1. There are some very clever algorithms that use the extra space in these live objects to store the data structures needed to reach all the live objects, so we don't need much extra memory during garbage collection (because at that time we don't have much extra memory!). In the sweep phase, the GC sweeps through the heap memory and puts on the linked list of free memory blocks all those memory blocks that it enoucounters that are NOT marked as live. If possible, it will compact two adjacent free memory blocks into one larger one. More sophisticated "compacting" GC algorithms can also change live pointers so that their data occurs on the far right of the heap space, with all free heap space to the left (as per the diagram above). Finally, note that in Java (or any automatic GC language) when we are storing pointer data in an array (say a Set) and we perform a "clear" operation, typically we set used to 0 AND store nullptr in every previously used location in the array. This ensures that objects with pointers to them currently stored in the array can be garbage collected (if there are no other references to them). Why do we have to set everything to nullptr and not just set used to 0? Even though WE know know that no array positions store useful data when used is 0, the garbage collector thinks every pointer in an array is live. If we left those pointers stored in the array, the garbage collector would consider all pointers in the array as live when doing the Mark phase, and not garbage collect those object. Of course the code works correctly either way, but if we don't set object reference to nullptr, we may eventually run out of space. So clear becomes O(N) instead of O(1) if we must remove pointers. Garbage collectors have problems as well: they run at unpredictable times and take an unpredictable amount of time to run (although we can determine limits on their run time). So, there are some mission critical real-time programs that do not allow the use of "new" because of this unpredictability. For example, in real-time applications (such as software flying an airplane), we would like to ensure that garbage collection does not take place during a critical phase (like landing). So, some real-time software prohibits the use of heap memory. There are ways around this unpredictability. We can run an "incremental" GC at the same time as the program to minimize the frequency and length of pauses in actual code. Say, every 100 milliseconds the GC runs for a few milliseconds, doing some of it work. The result is that the program executes a few percent slower (typically not a big problem on fast CPUs), but garbage collection runs more predictably. When it is required, much of its work has been accomplished. In fact, with multi-core CPUs, we can always run a GC on one of the cores to minimize the impact of automatical garbage collection. ------------------------------------------------------------------------------ A Class Doing its own Memory Management A class can do its own memory management. Imagine a linked list of LN for some application (say in a hash table implementing a Map). When we erase an item, instead of deleting its LN, we could put it on a free list: allocated LN objects that are no longer needed. Then, when we need to allocate a new LN, we first look on that free list; if it contains at least one LN, we use it. If it doesn't we call "new" to get the LN we need. The destructor for such a class would ultimately delete all the LNs it was using and on its free list. The main problem with this approach is that if we have multiple data structures and each has its own free list, a data structure needing a new LN can find only those it is controlling; it cannot easily use LNs on the free list controlled by another data structure. By using "delete" instead, all free storage ends up controlled by the same mechanism, so any sized data structure can be allocated. Final Words: One lecture on the material described above and below is not enough to get a truly intuitive feeling for the information. The course ICS 51 (Introductory Computer Organization) and courses on programming languages and operating sstems covers these topics in much more depth. Read about these terms on the internet as well (e.g., Wikipedia). Use the names provided here.