Lecture 13: Caching

🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Wednesday, March 13).

Caching and the Storage Hierarchy

We are now switching gears to talk about one of the most important performance-improving concepts in computer systems. This concept is the idea of cache memory.

Why are we covering this?

Caching is an immensely important concept to optimize performance of a computer system. As a software engineer in industry, or as a researcher, you will probably find yourself in countless situations where "add a cache" is the answer to a performance problem. Understanding the idea behind caches, as well as when a cache works well, is important to being able to build high-performance applications.

We will look at specific examples of caches, but a generic definition is the following: a cache is a small amount of fast storage used to speed up access to larger, slower storage.

One reasonable question is what we actually mean by "fast storage" and "slow storage", and why need both. Couldn't we just put all of the data on our computer into fast storage?

To answer this question, it helps to look at what different kinds of storage cost and how this cost has changed over time.

The Storage Hierarchy

When we learn about computer science concepts, we often talk about "cost": the time cost and space cost of algorithms, memory efficiency, and storage space. These costs fundamentally shape the kinds of solutions we build. But financial costs also shape the systems we build, and the costs of the storage technologies we rely on have changed dramatically, as have their capacities and speeds.

The table below gives the price per megabyte of different storage technology, in price per megabyte (2010 dollars), up to 2019. (Note that flash/SSD storage did not exist until the early 2000s, when the technology became available.)

Year Memory (DRAM) Flash/SSD Hard disk
~1955 $411,000,000 $9,200
1970 $734,000.00 $260.00
1990 $148.20 $5.45
2003 $0.09 $0.305 $0.00132
2010 $0.019 $0.00244 $0.000073
2021 $0.003 $0.00008 $0.0000194

(Prices due to John C. McCallum, and inflation data from here. $1.00 in 1955 had "the same purchasing power" as $9.62 in 2019 dollars.)

Computer technology is amazing – not just for what it can do, but also for just how tremendously its cost has dropped over the course of just a few decades. The space required to store a modern smartphone photo (3 MB) on a harddisk would have costs tens of thousands of dollars in the 1950s, but now costs a fraction of a cent.

But one fundamental truth has remained the case across all these numbers: primary memory (DRAM) has always been substantially more expensive than long-term disk storage. This becomes even more evident if we normalize all numbers in the table to the cost of 1 MB of harddisk space in 2019, as the second table below does.

Year Memory (DRAM) Flash/SSD Hard disk
~1955 219,800,000,000,000 333,155,000
1970 39,250,000,000 13,900,000
1990 7,925,000 291,000
2003 4,800 16,300 70
2010 1,000 130 3.9
2021 155 4.12 1

As a consequence of this price differential, computers have always had more persistent disk space than primary memory. Harddisks and flash/SSD storage are persistent (i.e., they survive power failiures), while DRAM memory is volatile (i.e., its contents are lost when the computer loses power), but harddisk and flash/SSD are also much slower to access than memory.

In particular, when thinking about storage performance, we care about the latency to access data in storage. The latency denotes the time it takes until data retrieved is available if read, or until it is on the storage medium if written. A longer latency is worse, and a smaller latency better, as a smaller latency means that the computer can complete operations sooner.

Another important storage performance metric is throughput (or "bandwidth"), which is the number of operations completed per time unit. Throughput is often, though not always, the inverse of latency. An ideal storage medium would habe low latency and high throughput, as it takes very little time to complete a request, and many units of data can be transferred per second.

In reality, though, latency generally grows, and throughput drops, as storage media are further and further away from the processor. This is partly due to the storage technologies employed (some, like spinning harddisks, are cheap to manufacture, but slow), and partly due to the inevitable physics of sending information across longer and longer wires.

The table below shows the typical capacity, latency, and throughput achievable with the different storage technologies available in our computers.

Storage type Capacity Latency Throughput (random access) Throughput (sequential)
Registers ~30 (100s of bytes) 0.5 ns 16 GB/sec (2x109 accesses/sec)
DRAM (main memory) 8 GB 60 ns 100 GB/sec
SSD (stable storage) 512 GB 60 µs 550 MB/sec
Hard disk 2–5 TB 4–13 ms 1 MB/sec 200 MB/sec

This notion of larger, cheaper, and slower storage further away from the processor, and smaller, faster, and more expensive storage closer to it is referred to as the storage hierarchy, as it's possibly to neatly rank storage according to these criteria. The storage hierarchy is often depicted as a pyramid, where wider (and lower) entries correspond to larger and slower forms of storage.

This picture includes processor caches, which are small regions of fast (SRAM) memory that are on the processor chip itself. The storage hierarchy shows the processor caches divided into multiple levels, with the L1 cache (sometimes pronounced "level-one cache") closer to the processor than the L2, L3, and L4 caches. This reflects how processor caches are actually laid out, but we often think of a processor cache as a single unit.

Different computers have different sizes and access costs for these hierarchy levels; the ones in the table above are typical. Here are some more, based on Malte's MacBook Air from ca. 2013: a few hundred bytes of of registers; ~5 MB of processor cache; 8 GB primary memory; 256 GB SSD. The processor cache divides into three levels: 128 KB of total L1 cache, divided into four 32 KB components; each L1 cache is accessed only by a single processor core (which makes it faster, as cores don't need to coordinate). There are 256 KB of L2 cache, and there are 4 MB of L3 cache shared by all cores.

Each layer in the storage hierarchy acts as a cache for the following layer.

Finally, consider how the concept of caching abounds in everyday life, too. Imagine how life would differ without, say, fast access to food storage – if every time you felt hungry, you had to walk to a farm and eat a carrot you pulled out of the dirt. Your whole day would be occupied with finding and eating food! (Indeed, this is what some animals spend most of their time doing.) Instead, your refrigerator (or your dorm's refrigerator) acts as a cache for your neighborhood grocery store, and that grocery store acts as a cache for all the food producers worldwide.

Measuring Actual Cache Performance

We can see the CPU's cache in action by running our ./arrayaccess program under a tool that reads information from special "performance counter" registers in the processor. The perf.sh script invokes this tool and sets it to measure the last-level cache (LLC) accesses ("loads") and misses. In our example, the L3 cache is the last-level cache.

When we run ./arrayaccess -u -i 1 10000000, we should expect a hit rate smaller than 100%, since the first access to each block triggers a cache miss. Sometimes, but not always, prefetching manages to ensure that the next block is already in the cache (which increases the hit rate); the precise numbers depend on a variety of factors, including what else is running on the computer at the same time.

For example, we may get the following output:

$ ./perf.sh ./arrayaccess -u -i 1 10000000
accessing 10000000 integers 1 times in sequential order:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, ...]
OK in 0.007934 sec!

 Performance counter stats for './arrayaccess -u -i 1 10000000':

           307,936      LLC-misses                #   25.67% of all LL-cache hits
         1,199,483      LLC-loads

       0.048342578 seconds time elapsed
This indicates that we experienced a roughly 74% cache hit rate (100% minus the 26% miss rate) – a rather decent result.

If we instead run the arrayaccess program with a random access pattern, the hit rate is much lower and there are more misses:

$ ./perf.sh ./arrayaccess -r -i 1 10000000
accessing 10000000 integers 1 times in random order:
[4281095, 3661082, 3488908, 9060979, 7747793, 8711155, 427716, 9760492, 9886661, 9641421, 9118652, 490027, 3368690, 3890299, 4340420, 7513926, 3770178, 5924221, 4089172, 3455736, ...]
OK in 0.152167 sec!

 Performance counter stats for './arrayaccess -r -i 1 10000000':

        19,854,838      LLC-misses                #   78.03% of all LL-cache hits
        25,443,796      LLC-loads

       0.754197032 seconds time elapsed
Here, the hit rate is only 22%, and 78% of cache accesses result in misses. As a result, the program runs 16x slower than when it accessed memory sequentially and benefited from locality of reference and prefetching.

By varying the size of the array and observing the miss rate measured, we can work out the size of the L3 cache on the computer we're running on: once the array is smaller than the cache size, the hit rate will be nearly 100%, since no cache blocks every get evicted.

Caching and File I/O

Caching exists at many layers of computer systems!

The programs diskio-slow and diskio-fast in the lecture code illustrate the huge difference caching can make to performance. Both programs write bytes to a file they create (the file is simply called data; you can see it in the lecture code directory after running these programs).

diskio-slow is a program that writes data to the computer's disk (SSD or harddisk) one byte at a time, and ensures that the byte is written to disk immediately and before the operation returns (the O_SYNC flag passed to open ensures this). It can write a few hundred bytes per second – hardly an impressive speed, as writing a single picture (e.g., from your smartphone camera) would take several minutes if done this way!

diskio-fast, on the other hand, writes to disk via series of caches. It easily achieves write throughputs of hundreds of megabytes per second: in fact, it writes 50 MB in about a tenth of a second on my laptop! This happens because these writes don't actually go to the computer's disk immediately. Instead, the program just writes to memory and relies on the operating system to "flush" the data out to stable storage over time in a way that it deems efficient. This improves performance, but it does come with a snag: if my computer loses power before the operating system gets around to putting my data on disk, it may get lost, even though my program was under the impression that the write to the file succeeded.

Summary

Today, we started by looking at caching in action and measured the hit and miss rates of real-world processor caches. We also talked about how the concept of caching shows up again in between applications and files on disk, where I/O libraries cache the contents of files on disk in RAM. There is a lot more to say about caches and as you complete Project 3, you will build your own cache!