Lecture 3: Pointers and Data Representation

🎥 Lecture video (Brown ID required)
💻 Lecture code
❓ Post-Lecture Quiz (due 11:59pm, Monday, February 3).

Pointers and How To Use Them

We previously discussed that memory boxes can store addresses of other memory boxes, and how an address occupies 8 bytes. C provides the ampersand operator (&) to get the address of any variable. In other words, if you have a variable local, writing &local gives you the address of the memory boxes that store local.

Where are you going to store the address value returned from &local? Well, if you want to hold on to it, you'll want to store it in a variable itself. Where does that variable live? It better be in memory, too! In other words, the 8 bytes corresponding to the address of &local will have a memory location of their own. We refer to such memory locations that hold addresses as pointers, because you can think of them as arrows pointing to other memory boxes.

In terms of C types, a type followed by an asterisk corresponds to a pointer. For example, int* is a pointer to an integer. An int* itself occupies 8 bytes of memory (since it stores an address), and it points to the first byte of a 4-byte sequence of memory boxes that store an int.

To actually get to the value that a pointer points to, you use the * (asterisk) operator on the pointer (see datarep/ptr-intro.c in the lecture code):

void f() {
    int local = 1;
    int* ptr = &local;

    // prints 1
    printf("value of ptr: %d\n", *ptr);  // <== here, we "dereference" the pointer to get to the value of local

    // prints the address of local, twice
    printf("address of local: %p %p\n", &local, ptr); // <== "%p" is a printf format for printing pointers!
}
For some people, it helps to think of the asterisk operator as "cancelling" out the asterisk on the type: i.e., *(int*) == int.

Some C programmers like to put the asterisk next to the type (int* ptr for int-pointer ptr), others put it with the variable name (int *ptr), because that way it's clear that the dereferenced version of ptr is an int. Use whatever notation makes sense for you!

To change the value behind a pointer, you must dereference it. For example, the following program changes the value of integer local through the pointer ptr by assigning a value to the dereferenced pointer:

void f() {
    int local = 1;
    int* ptr = &local;

    *ptr           = 42;
// |-------------|   ^
// deref'd pointer   | value we assign
//     = int         | (also a int)

    printf("value of local is now: %d\n", local);  // <== prints 42
}

Pointers are how to use the are very important in this course and in C/C++ programming in general. We'll keep coming back to these concepts. The key things to remember are: & takes the address of an object and makes a pointer, * dereferences a pointer and follows it to the value it refers to in memory. Types with an asterisk next to them are pointer types.

Strings!

So far, we've looked primarily at numbers. Let's look at strings of characters (mexplore-string.c) instead, since you'll need to work with strings for Lab 1 and Project 1.

The program defines two strings, and their types are set to char[]. char is the name for a byte type in the C language; it refers to the fact that a byte is exactly sufficient to store one character according to the ASCII standard, a way of translating numbers into characters and vice versa. Computers can only store numbers, so all characters in a computer are actually "encoded" as numbers. For example, the uppercase letter "A" in ASCII corresponds to the number 65 (see man ascii for the translation table).

What's ASCII, and do we still use it today?

In the early days of computers, every computer had its own way of encoding letters as numbers. ASCII, the American Standard Code for Information Interchange, was defined in the 1960s to find a common way of representing text. Original ASCII uses 7 bits, and can therefore represent 128 distinct characters – enough for the alphabet, numbers, and some funky special characters (e.g., newlines (\n), the NUL character (\0), and "bell" character that made typewriter bells go off).
But even 256 characters aren't sufficient to support languages that use non-Latin alphabets, and certainly not for advanced emoji. So, while all of today's computers still support ASCII, we've mostly moved on to a new standard, Unicode, which supports 1.1 million characters using one to four bytes per character. Fun fact: to be backwards compatible, Unicode is defined such that the original ASCII character encodings remain valid Unicode letters.

A string in C is simple a sequence of bytes, each represented as a char. So, to look at the string in memory, we need to change our call to hexdump() to print the right number of bytes. Our two strings contain 12 and 13 characters, so that's 12 and 13 bytes. And indeed, ./mexplore-string prints those those bytes if you pass 12 and 13 to hexdump(). But what if we pass a larger number? Turns out we just get whatever is in memory next to the string. For global_st in our example, that's a bunch of garbage. If we keep going long enough (e.g., 20,000 bytes), the program actually crashes with a Segmentation fault (SEGV). You can probably guess now what that error is about: the program tried to access memory outside of a valid segment, and the OS terminated it to keep things safe.

But how does a program know when a string ends? We may not always know the length at compile time, as we do in these examples. It turns out that there's a bonus NUL (0x0, sometimes written \0) character hanging around at the end of every string. The NUL byte is a terminator that has the special meaning "end-of-string". Every C string must have a NUL byte at its end. (Forgetting the NUL byte is a huge source of bugs in the real world and in your assignments. Watch out for it; this matters for Lab 1 and Project 1!) So, our strings are really 13 bytes long, not 12.

Summary

Today, we looked at the notion of pointers (values storing addresses) in C. Then, we discussed more about where program data lives in memory. We learned about dynamically-allocated memory, which is hugely important for real-world programs that need to keep objects around after the end of a function or create objects whose size is only known at runtime. Finally, we understood how C strings are represented as sequences of bytes terminated with a NUL byte, and how easy it is to accidentally go past their end. You're now in a good place to work on Lab 1, but we'll talk more about memory organization, arrays and pointer arithmetic next time.