« Back to the main CS 300 website
Lab 6: Threads
Due Tuesday, April 23rd at 8:00 PM EST
Introduction
In Lab 5, you implemented a concurrent program using multiple processes. With multiple processes, each process gets its own distinct address space. By default, no data is shared, and in Lab 5, you dealt with this by using pipes to send data between processes. (Another approach is to designate a shared region of memory with the mmap
system call.) However, sometimes you want concurrency within a process, with the different parallel executions truly sharing memory.
In this lab, you will gain experience creating concurrent programs using multiple threads. Using multiple threads rather than processes is beneficial as threads of the same process share an address space. The only private region of memory for threads is their stack, so any data contained in the heap or global sections are shared across all threads of the same processes, and their changes are visible to each other.
It is important to note the difference between a process and a thread. A process is the general container responsible for the execution of a program. Threads are the individual entities within a process which actually do the execution. A single process can contain multiple threads all working in its address space. Every process contains at least one thread, the "main" thread, whose execution starts in main
!
Assignment
Assignment installation
First, ensure that your repository has a handout
remote. Type:
$ git remote show handout
If this reports an error, run:
$ git remote add handout
https://github.com/csci0300/cs300-s24-labs.git
Then run:
$ git pull
$ git pull handout main
This will merge our Lab 6 stencil code with your previous work. If you have any "conflicts" from Lab 5, resolve them before continuing further. Run git push
to save your work back to your personal repository.
Assignment Specifications
In this lab, you will again implement a parallel reverse index search; however, this time you will do so by taking advantage of multiple threads rather than multiple processes. You will work with the same basic reverse index program as in Lab 6. The program takes in a search term and produces a list of all occurrences of that term in a corpus of documents.
We provide you with the following files:
wordindex.h
: This file contains all shared functions and data structures. This includes the wordindex
class for storing the results of a search, all functions involving reading files and directories, and functions to print out the results. You should not modify anything in this file, but you will definitely want to look at the functions and data structures it contains since you will have to use them.
revindex_sequential.cpp
: This file contains a single-threaded, sequential implementation of a reverse index search. You should use this as a baseline for the performance of your reverse index, and you may want to look at the code to get an idea of how the reverse index should work. However, you should not modify anything in this file.
revindex_threads.cpp
: This file contains a multi-threaded implementation of a reverse index search. This program should take advantage of concurrency by creating multiple threads such that the reverse index of a search term in multiple files can be calculated at once. Each file should be searched by its own thread. You only need to modify this file to complete the lab.
Note: This lab does not require you to use synchronization objects or atomics.
Why?
If you implement the lab correctly, each thread works on a different file and shares no state with other threads.
We also give you some input files that are helpful for testing.
- The books directory containing a series of .txt files of classic novels
input.txt
: This file contains a series of search terms separated by a newline. It can be used for testing purposes to execute multiple searches by running ./revindex_threads books < input.txt
WORD.txt
: all files called WORD.txt (besides input.txt
) contain the expected results of a reverse index of the term WORD
in the books directory. These files are meant for testing/debugging purposes.
Task:
Implement the run_workers
function in revindex_threads.cpp
to spawn multiple threads to execute find_word
. More details of the implementation of run_workers
can be found in the stencil comments.
Hints
- It is important to not start the joining threads until after every thread has been created. If you were to create one thread and then wait to join with it before creating another thread, you will only ever be running one thread at a time. This is not actually concurrent programming!
- Additionally, it is important that the wordindex pointer used as an argument to
find_word
is a pointer to a wordindex contained in the files vector. In C++, elements contained in a vector are allocated in the heap. It is necessary that your wordindex objects are contained in the heap so that they are visible across threads.
You may find it helpful to view the Relevant Library Functions portion of this handout when implementing your run_workers
function.
Running and Testing:
Compile:
- The whole project with
make all
- Just the provided sequential implementation with
make revindex_sequential
- Just the parallel implementation with
make revindex_threads
Run:
The books directory is provided for you, but feel free to create your own directories containing fewer or shorter files for preliminary testing purposes.
- The sequential implementation with
./revindex_sequential <directory>
- The parallel implementation with
./revindex_threads <directory>
Test:
We have provided you with the expected output for some search terms over the books directory (all files titled WORD.txt
), so that you can compare your results. You can also compare your implementation with the sequential implementation via make check
. The results should be the same.
$ make check
./revindex_threads books < input.txt > output_threads.txt
./revindex_sequential books < input.txt > output_sequential.txt
diff output_threads.txt output_sequential.txt
... You should receive no output/differences ...
- Runtime: Additionally, once you have a working reverse index using parallel processes, you should find a performance improvement over the sequential implementation. To verify this, run:
$ time ./revindex_threads books < input.txt
$ time ./revindex_sequential books < input.txt
... the sequential implementation should be slower than the parallel ...
You can also compare the performance of your multithreaded reverse index search with the multi-process implementation from Lab 6. Which one do you expect to be faster?
Relevant Library Functions
Below you will find a list of functions you may find helpful in completing this lab. Feel free to also reference the man pages and C++ references for more details.
std::thread
std::thread(func, args...)
- The
std::thread
constructor spawns a new thread that beings running the function, func
, with the specificed arguments
- Arguments:
- A function to be executed by the new thread.
- Any number of arguments as needed by the specified function
- Return value: a thread handle that you can use to join the thread (see
join()
).
- Below is an example of standard
std::thread
usage:void some_func(int x, int y){
…
}
void create_threads(){
std::thread t = thread(some_func, 1, 2);
t.join();
}
std::thread::join
join()
- The
std::thread
join
function waits for the execution of a thread to complete. Calling t.join()
will halt the calling thread until thread t
, has completed. (join
will return immediately if the thread has already completed.)
Debugging Resources/ GDB
Switching between threads:
When your program is stopped in GDB, you can gain information on the current state of your threads running the command info threads
(you can stop your program in GDB either through a breakpoint or by typing Ctrl-Z). The info threads
command will provide you with a numbered list of every currently running thread. From there, you can use the command thread [number]
(where [number]
is the number listed by the thread) to inspect the current state of that thread and follow its execution.
Executing commands across multiple threads:
In GDB, you can use the command thread apply all <command>
to apply a given command to all threads. For example, you can use thread apply all bt
to get a the stack trace of every running thread.
Handin instructions
Turn in your code by pushing your git repository to github.com/csci300/cs300-s23-labs-YOURUSERNAME.git
.
Then, head to the grading server. On the "Labs" page, use the "Lab 6 checkoff" button to check off your lab.
« Back to the main CS 300 website
Lab 6: Threads
Due Tuesday, April 23rd at 8:00 PM EST
Introduction
In Lab 5, you implemented a concurrent program using multiple processes. With multiple processes, each process gets its own distinct address space. By default, no data is shared, and in Lab 5, you dealt with this by using pipes to send data between processes. (Another approach is to designate a shared region of memory with the
mmap
system call.) However, sometimes you want concurrency within a process, with the different parallel executions truly sharing memory.In this lab, you will gain experience creating concurrent programs using multiple threads. Using multiple threads rather than processes is beneficial as threads of the same process share an address space. The only private region of memory for threads is their stack, so any data contained in the heap or global sections are shared across all threads of the same processes, and their changes are visible to each other.
It is important to note the difference between a process and a thread. A process is the general container responsible for the execution of a program. Threads are the individual entities within a process which actually do the execution. A single process can contain multiple threads all working in its address space. Every process contains at least one thread, the "main" thread, whose execution starts in
main
!Assignment
Assignment installation
First, ensure that your repository has a
handout
remote. Type:If this reports an error, run:
Then run:
This will merge our Lab 6 stencil code with your previous work. If you have any "conflicts" from Lab 5, resolve them before continuing further. Run
git push
to save your work back to your personal repository.Assignment Specifications
In this lab, you will again implement a parallel reverse index search; however, this time you will do so by taking advantage of multiple threads rather than multiple processes. You will work with the same basic reverse index program as in Lab 6. The program takes in a search term and produces a list of all occurrences of that term in a corpus of documents.
We provide you with the following files:
wordindex.h
: This file contains all shared functions and data structures. This includes thewordindex
class for storing the results of a search, all functions involving reading files and directories, and functions to print out the results. You should not modify anything in this file, but you will definitely want to look at the functions and data structures it contains since you will have to use them.revindex_sequential.cpp
: This file contains a single-threaded, sequential implementation of a reverse index search. You should use this as a baseline for the performance of your reverse index, and you may want to look at the code to get an idea of how the reverse index should work. However, you should not modify anything in this file.revindex_threads.cpp
: This file contains a multi-threaded implementation of a reverse index search. This program should take advantage of concurrency by creating multiple threads such that the reverse index of a search term in multiple files can be calculated at once. Each file should be searched by its own thread. You only need to modify this file to complete the lab.Note: This lab does not require you to use synchronization objects or atomics.
Why?
If you implement the lab correctly, each thread works on a different file and shares no state with other threads.
We also give you some input files that are helpful for testing.
input.txt
: This file contains a series of search terms separated by a newline. It can be used for testing purposes to execute multiple searches by running./revindex_threads books < input.txt
WORD.txt
: all files called WORD.txt (besidesinput.txt
) contain the expected results of a reverse index of the termWORD
in the books directory. These files are meant for testing/debugging purposes.Task:
Implement the
run_workers
function inrevindex_threads.cpp
to spawn multiple threads to executefind_word
. More details of the implementation ofrun_workers
can be found in the stencil comments.Hints
find_word
is a pointer to a wordindex contained in the files vector. In C++, elements contained in a vector are allocated in the heap. It is necessary that your wordindex objects are contained in the heap so that they are visible across threads.You may find it helpful to view the Relevant Library Functions portion of this handout when implementing your
run_workers
function.Running and Testing:
Compile:
make all
make revindex_sequential
make revindex_threads
Run:
./revindex_sequential <directory>
./revindex_threads <directory>
Test:
$ make check ./revindex_threads books < input.txt > output_threads.txt ./revindex_sequential books < input.txt > output_sequential.txt diff output_threads.txt output_sequential.txt ... You should receive no output/differences ...
$ time ./revindex_threads books < input.txt $ time ./revindex_sequential books < input.txt ... the sequential implementation should be slower than the parallel ...
You can also compare the performance of your multithreaded reverse index search with the multi-process implementation from Lab 6. Which one do you expect to be faster?
Relevant Library Functions
Below you will find a list of functions you may find helpful in completing this lab. Feel free to also reference the man pages and C++ references for more details.
std::thread
std::thread(func, args...)
std::thread
constructor spawns a new thread that beings running the function,func
, with the specificed argumentsjoin()
).std::thread
usage:std::thread::join
join()
std::thread
join
function waits for the execution of a thread to complete. Callingt.join()
will halt the calling thread until threadt
, has completed. (join
will return immediately if the thread has already completed.)Debugging Resources/ GDB
Switching between threads:
When your program is stopped in GDB, you can gain information on the current state of your threads running the command
info threads
(you can stop your program in GDB either through a breakpoint or by typing Ctrl-Z). Theinfo threads
command will provide you with a numbered list of every currently running thread. From there, you can use the commandthread [number]
(where[number]
is the number listed by the thread) to inspect the current state of that thread and follow its execution.Executing commands across multiple threads:
In GDB, you can use the command
thread apply all <command>
to apply a given command to all threads. For example, you can usethread apply all bt
to get a the stack trace of every running thread.Handin instructions
Turn in your code by pushing your git repository to
github.com/csci300/cs300-s23-labs-YOURUSERNAME.git
.Then, head to the grading server. On the "Labs" page, use the "Lab 6 checkoff" button to check off your lab.