Lab 6: Threads - HedgeDoc

<style> details { margin-top: 0.5em; margin-bottom: 0.5em; } summary { //font-weight: bolder; } summary:hover { text-decoration: underline; } blockquote { font-size: 16px; } .todo { color: #ff00ff; border: 2px dashed #ff00ff; padding: 0em 1em; border-radius: 5px; margin-top: 1em; margin-bottom: 1em; //display: none; // UNCOMMENT TO HIDE TODOs } </style> **[« Back to the main CS 300 website](https://csci0300.github.io)** # Lab 6: Threads **Due:** Tuesday, April 22 at 8:00 PM EST --- # Introduction In Lab 5, you implemented a concurrent program using multiple processes. With multiple processes, each process gets its own distinct address space. By default, no data is shared, and in Lab 5, you dealt with this by using pipes to send data between processes. (Another approach is to designate a shared region of memory with the `mmap` system call.) However, sometimes you want concurrency within a process, with the different parallel executions truly sharing memory. In this lab, you will gain experience creating concurrent programs using multiple threads. Using multiple threads rather than processes is beneficial as threads of the same process share an address space. The only private region of memory for threads is their stack, so any data contained in the heap or global sections are shared across all threads of the same processes, and their changes are visible to each other. **It is important to note the difference between a process and a thread.** A *process* is the general container responsible for the execution of a program. *Threads* are the individual entities within a process which actually do the execution. A single process can contain multiple threads all working in its address space. Every process contains at least one thread, the "main" thread, whose execution starts in `main`! --- # Assignment ## Assignment Installation First, open a terminal in your container environment and `cd` into your `labs` folder, then ensure that your repository has a `handout` remote. To do this, type: ```bash $ git remote show handout ``` If this reports an error, run: ```bash $ git remote add handout https://github.com/csci0300/cs300-s25-labs.git ``` Then run: ```bash $ git pull $ git pull handout main ``` This will merge our Lab 6 stencil code with your previous work. :::warning **You may get a warning about "divergent branches"** like shown below: ![](https://i.imgur.com/BeYsMKC.png) This means you need to set your Git configuration to merge our code by default. To do so, run (within the same directory in your labs repository workspace): ```console $ git config pull.rebase false $ git pull handout main ``` ::: If you have any "conflicts" from Lab 5 (although this is unlikely), resolve them before continuing further. Run `git push` to save your work back to your personal repository. ## C++ Tips Like Lab 5, this lab uses C++, and includes some new syntax that may be different from the C++ you've seen so far. If you skipped lab 5, please take a look at the following important C++ tips: <details><summary>Expand for details</summary> **Arrays:** To allocate memory for C++ arrays on the heap, you should use the `new` keyword. To free that memory, you would use the `delete` keyword. This is shown in the next section, which describes the function of the `new` and `delete` keyword. **New/Delete:** C++ lets you allocate heap memory via the new operator, which takes the name of a class (or struct) to instantiate an object in heap memory. For example, to make a heap-allocated array of ints, you would write: ```cpp dataType* heapArray = new dataType[size]; delete[] heapArray; ``` The `new` keyword is essentially the C++ equivalent of mallocing memory and instantiating a object. When you no longer need memory created with `new`, you use `delete` to destroy it. In this specific case, `delete[]` is used to delete dynamically allocated arrays. **Vectors:** The `std::vector` data type from the standard C++ library stores elements of similar data types in an array-like format. However, unlike the array, the `vector` data type can grow dynamically, similar to an `ArrayList` in Java. You can declare an empty vector variable and add elements as shown: ```cpp std::vector<int> heap_vec; heap_vec.push_back(1); heap_vec.push_back(2); heap_vec.push_back(3); ``` </details> <br /> **As you read the rest of the handout, please look for other C++ tips and notes on syntax.** There are a few syntax-related gotchas related to C++ data structures and creating threads that can lead to some gnarly compilation errors, so we've tried to provide guidance for how to interpret them. ## Assignment Specifications In this lab, you will again implement a parallel reverse index search, but this time you will do so by taking advantage of multiple threads rather than multiple processes. To do this, you will work with the same basic reverse index program as in Part 2 of Lab 5, which takes in a search term and produces a list of all occurrences of that term in a corpus of documents: ``` $ make [ . . . ] $ ./revindex_sequential books peace # Search for "peace" in all files in books directory Found 305 instances of peace. peace found in books/A Christmas Carol in Prose; Being a Ghost Story of Christmas by Charles Dickens.txt at locations: Index 5608: rest, no peace. Incessant torture peace found in books/The Adventures of Tom Sawyer by Mark Twain.txt at locations: Index 5743: was at peace, now that Index 7544: held his peace, for there Index 11064: of the peace; the widow Index 26252: soul at peace again; for Index 29805: repose and peace in the Index 36050: grumblings, and peace resumed her Index 36497: first making peace, and this Index 36509: pipe of peace. There was Index 45431: of the peace, who was Index 57224: of the peace that jugged Index 62303: somewhat of peace and healing peace found in books/Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley.txt at locations: Index 5505: repose in peace. I understand Index 18298: now at peace for ever. [ . . .] ``` In this lab, your goal is to use multiple threads to implement a version of the same program (`./revindex_threads`, defined in `revindex_threads.cpp`). #### Stencil organization We give you with the following files: * `wordindex.cpp`: This file contains all shared functions and data structures. This includes the `wordindex` class for storing the results of a search, all functions involving reading files and directories, and functions to print out the results. You should not modify anything in this file, but you will definitely want to look at the functions and data structures it contains since you will have to use them. * `revindex_sequential.cpp`: This file contains a single-threaded, sequential implementation of a reverse index search. You should use this as a baseline for the performance of your reverse index, and you may want to look at the code to get an idea of how the reverse index should work. However, you should not modify anything in this file. * `revindex_threads.cpp`: This file contains a multi-threaded implementation of a reverse index search. This program should take advantage of concurrency by creating multiple threads such that the reverse index of a search term in multiple files can be calculated at once. Each file should be searched by its own thread. **You only need to modify this file to complete the lab.** ::::success :eyes: **Important note on what to expect:** This lab does not require you to use synchronization objects or atomics. If implemented correctly, each thread works on a different file and shares no state with other threads, so there should be no need for any synchronization between worker threads. Keep this in mind as you set up each thread! :::: In addition, we also provide some input files that are helpful for testing. * The directories `books` and `tiny` contain input files to use for testing. * The `expected_output` folder contains the expected output for searching for certain words on both the `books` and `tiny` corpuses, to check your output without waiting for runs of the sequential version. See the [Running and Testing](#Running-and-Testing) section for more details on how to use the test files. ::::success **Task:** Implement the `run_workers` function in `revindex_threads.cpp` to spawn multiple threads to execute `find_word`. More details of the implementation of `run_workers` can be found in the stencil comments. > <details><summary><strong>Hints</strong></summary> * It is important to not start joining the threads until after every thread has been created. If you were to create one thread and then wait to join with it before creating another thread, you will only ever be running one thread at a time, which would not take advantage of concurrent or parallel programming! * Additionally, keep in mind that the wordindex pointer used as an argument to `find_word` must be a pointer to a wordindex contained in the `results` vector. In C++, elements contained in a vector are allocated in the heap. </details> You may find it helpful to view the [Relevant Library Functions](#Relevant-Library-Functions) portion of this handout when implementing your `run_workers` function. :::: ### Running and Testing You can test your implementation using examples like the following. We recommend testing your work by starting small: first use one worker process using the `--workers` argument and use the "tiny" corpus, then go up to the full "books" corpus and use the full 8 worker processes: ``` $ make # Example: search for "world" in the "tiny" corpus (1 worker process) $ ./revindex_threads --workers 1 tiny world $ make # Example: search for "world" in the "tiny" corpus (8 worker processes) $ ./revindex_threads tiny world # Example: search for "peace" in the "books" corpus $ ./revindex_threads books peace ``` **Testing correctness**: To test your output, you can compare the output against a similar run of the sequential program, or compare against any of the output files in the `expected_output` directory. Your output should match the expected output exactly when search for the same word with the same corpus. Here's an example of how you can check your output for differences using the `diff` command: ```shell # Run a search and write to out.txt $ ./revindex_threads tiny world | tee out.txt diff expected_output/tiny/world.txt out.txt # Compare to expected result for this search # No output means that there were no differences! ``` If you check your results with the `diff` command, `diff` should return no output when the files are equal--this is what you want! If you'd like to check the expected results for any other search terms not present in `expected_output`, you can run the sequential version on the same term and corpus and compare the output, like this (here, searching for the word "potato"): ```shell $ ./revindex_sequential books potato > potato-expected.txt # Run expected version $ ./revindex_threads books potato > potato-actual.txt # Run your version $ diff potato-expected.txt potato-actual.txt # Should show no differences ``` *Fun fact: The `expected_output` directory is just a cache of some test runs from the sequential version that we selected--so you don't need to run the sequential program a bunch of times and wait for the output!* **Testing runtime**: Additionally, once you have a working reverse index using multiple threads, you should find a performance improvement over the sequential implementation. To verify this, you can run the `time` command on both implementations as follows: ```console $ time ./revindex_threads books world $ time ./revindex_sequential books world # [the `real` time should be slower for sequential than threads] ``` You can also compare the performance of your multithreaded reverse index search with the multi-process implementation from Lab 5. Which one do you expect to be faster? ## Relevant Library Functions Below you will find a list of functions you may find helpful in completing this lab. Feel free to also reference the man pages and C++ references for more details. ::::info <details> <summary><b>std::thread</b></summary> * `std::thread(func, args...) ` * The `std::thread` constructor spawns a new thread that beings running the function, `func`, with the specificed arguments * Arguments: * A function to be executed by the new thread. * Any number of arguments as needed by the specified function * Return value: a thread handle that you can use to join the thread (see `join()`). * Below is an example of standard `std::thread` usage: ```c void some_func(int x, int y){ … } void create_threads(){ // create and run the thread std::thread t = thread(some_func, 1, 2); // wait for it to complete t.join(); } ``` </details> :::: ::::info <details> <summary><b>std::thread::join</b></summary> * `join()` * The `std::thread` `join` function waits for the execution of a thread to complete. Calling `t.join()` will halt the calling thread until thread `t`, has completed. (`join` will return immediately if the thread has already completed.) </details> :::: ## Debugging Resources/ GDB ### Switching between threads: When your program is stopped in GDB, you can gain information on the current state of your threads running the command `info threads` (you can stop your program in GDB either through a breakpoint or by typing Ctrl-Z). The `info threads` command will provide you with a numbered list of every currently running thread. From there, you can use the command `thread [number]` (where `[number]` is the number listed by the thread) to inspect the current state of that thread and follow its execution. ### Executing commands across multiple threads: In GDB, you can use the command `thread apply all <command>` to apply a given command to all threads. For example, you can use `thread apply all bt` to get a the stack trace of every running thread. --- # Handin instructions Turn in your code by pushing your git repository to `github.com/csci300/cs300-s25-labs-YOURUSERNAME.git`. Then, head to the [grading server](https://cs300.cs.brown.edu/grade/2025). On the "Labs" page, use the **"Lab 6 checkoff"** button to check off your lab.