« Back to the main CS 300 website
Lab 5: Processes
Due Thursday, April 18th at 8:00 PM EST
Assignment
Caution: This lab (and particularly part II) has historically been the most time-consuming lab for students, therefore we highly recommend starting and seeking help early!
Assignment Installation
First, ensure that your repository has a handout
remote. Type:
$ git remote show handout
If this reports an error, run:
$ git remote add handout https://github.com/csci0300/cs300-s24-labs.git
Then run:
$ git pull
$ git pull handout main
This will merge our Lab 5 stencil code with your previous work. If you have any "conflicts" from Lab 4, resolve them before continuing further. Run git push
to save your work back to your personal repository.
C++ Tips
In both parts of this assignment, you will be using some C++ that might differ slightly from what has been previously used in C.
Arrays: To allocate memory for C++ arrays on the heap, you should use the new
keyword. To free that memory, you would use the delete
keyword. This is shown in the next section, which describes the function of the new
and delete
keyword.
New/Delete: C++ lets you allocate heap memory via the new operator, which takes the name of a class (or struct) to instantiate an object in heap memory. For example, to make a heap-allocated array of ints, you would write:
dataType* heapArray = new dataType[size];
delete[] heapArray;
The new
keyword is essentially the C++ equivalent of mallocing memory and instantiating a object.
When you no longer need memory created with new
, you use delete
to destroy it. In this specific case, delete[]
is used to delete dynamically allocated arrays.
Vectors: The vector
data type from the standard C++ library stores elements of similar data types in an array-like format. However, unlike the array, the vector
data type can grow dynamically. You can declare an empty vector variable and add elements as shown:
vector<int> heap_vec;
heap_vec.push_back(1);
heap_vec.push_back(2);
heap_vec.push_back(3);
Assignment Part I: Simple Shell
In the first part of this lab, you will write the process launching code for your own shell, called 300sh
. 300sh
is similar to the shell you type your commands into (a program called bash
) while running your Docker container, but it's a lot simpler.
Introduction
Your shell needs to be able to start new processes via system calls into the OS kernel. For this purpose, you will use the fork
and execv
system calls discussed in lectures.
What's the deal with execv
? Thus far, you have primarily written programs which execute a series of instructions, often organized into functions. However, sometimes you'd like to write programs that have the ability to execute other programs. This can be done on Linux with the exec
system call. When one program executes another using execv
, the current process image is completely replaced with that of the new program. The process image comprises of the virtual address space of the current process, including the stack, the heap, the global variables (.data
) and the code (.text
) sections. All of these are replaced by the corresponding sections in the new program. If execv
is successful, the original program will never be returned to as that process has been completely overwritten!
Here’s an example usage of execv
.
Try running this program in your Docker container with the file name example.cpp
.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main() {
char program_name[10] = "/bin/echo";
const char* arg1 = "hello";
const char* arg2 = "world";
const char* newargv[] = { program_name, arg1 , arg2, NULL };
execv("/bin/echo", (char* const*)newargv);
perror("execv failed");
exit(EXIT_FAILURE);
}
Running this program will produce this output:
$ gcc -o example example.cpp
$ ./example
hello world
Try changing the first argument of execv
on line 12 to an invalid program name (ex: /bin/hello
) and see what happens.
What's the deal with fork
? What if you want to execute a new program while still being able to return to the original, rather than overwriting it with the new program? This is where the ability to create multiple processes becomes useful. The fork
system call will create a new process (the "child process") whose address space is a copy of its parent process’s address space. However, while they map the same virtual addresses, these two processes run in separate address spaces, so changes to one do not affect the other!
Here’s an example usage of fork
.
Try running this program in your Docker container or repl.it
with the file name example.cpp
.
#include <stdio.h>
#include <unistd.h>
int main() {
printf("I am in the parent process.\n");
int x = 5;
pid_t child;
if ((child = fork()) == 0) {
x = x + 1;
printf("I am in the child process: x is %d.\n", x);
}
printf("I am outside the if statement.\n");
printf("Child is %d and x is %d.\n", child, x);
}
Running this program will produce an output similar to this:
$ gcc -o example example.cpp
$ ./example
I am in the parent process.
I am outside the if statement.
Child is 1753 and x is 5.
I am in the child process: x is 6.
I am outside the if statement.
Child is 0 and x is 6.
Note: Try running the program in your Docker container. (In online environments like repl.it, you may not see all the output.)
Since the child process is a copy of the parent's address space, it will perform the same instructions as its parent starting on line 10, and have a copy of its parent's stack, heap, and data (e.g., it has access to a copy of variable x
). fork
will return a different value depending on whether it's in the child or parent process – allowing us to execute different instructions in the child and parent process (e.g., the parent process can't execute instructions within the if
statement, but the child can). Currently, the child can also execute instructions after the if
statement. Let's change that!
Optional Task 1: Immediately after line 13, add the line exit(0);
, indicating to exit from the child process with return code 0. Also, include #include <stdlib.h>
as a header.
Optional Task 2: Immediately after line 14, declare a local integer called status
and then add the line waitpid(child, &status, 0);
, telling the parent to wait on the child to terminate before it continues executing instructions. Also, include #include <sys/wait.h>
as a header.
Assignment Specifications
For this part of the lab, you will be completing the provided implementation of a simple shell by adding functionality for executing programs. You are provided with the file sh.cpp
and its corresponding makefile. Currently, sh.cpp
contains a read-evaluate-print loop (REPL) which reads user input, and parses it into individual tokens. If you run ./300sh
after make
, you'll see that the shell currently doesn't do anything, except for the builtin commands exit
and cd
.
$ ./300sh
300sh> ls
300sh> yo
300sh> hello
300sh> /bin/ls
300sh> cd /bin
Current directory is now /bin
300sh> exit
Your task is to extend this functionality to be able to execute any arbitrary program (or fail if the program does not exist). This should all be done within the do_fork
function in the stencil.
When a process, in this case the shell, executes another program, the current process is completely replaced by the program it executes. In order for your shell to continue execution after running another program, it needs to create a new child process in which the new program can run as to not replace the shell’s execution. As such, if you create a new process (using fork
) in which to execute a different program (using execv
), the original program will remain preserved in the parent process and its execution can continue rather than be replaced.
Task:
Implement the do_fork(char *args[])
in sh.cpp
. This function should:
- Fork into a new child process.
- The child process should execute the desired command (the first element in
args
), with the desired arguments (the following elements in args
).
- The parent process should wait on the child process to terminate before executing the next command.
The argument of do_fork
, args
, is such that:
- the first argument is the full path to the command to be executed
- the following elements are any arguments to that command
- the last argument is
NULL
or nullptr
(meaning that args is null terminated)
For example, if args
were the array: {"/bin/echo", "hello", NULL}
, do_fork
would create a child process to execute the program /bin/echo
with argument hello
.
Understanding the arguments of execv
.
The execv
system call takes as arguments:
- the full path to the desired command
- an array of strings in which the first element is the program name (this could just be the full path) and the following elements are arguments to the command.
(Make sure to handle the case where execv fails!)
Check out the example usages of fork
and execv
above for more help.
When implementing this part of the lab you may find the fork, execv, and wait or waitpid system calls useful. Please read the Relevant System Calls section for more details.
Running and Testing
Compile and Run:
- Compile with the
make
- Run with
./300sh
When your shell is working, it should exhibit the same basic functionality as a regular bash terminal when running processes in the foreground. Just note that when executing programs in your shell, you will need to use the full pathname. For example, instead of typing ls
as you usually would, you will have to use /bin/ls
. (This is the case because your shell does not automatically expand paths for binaries in standard directories like /bin
.)
If you do not know the full pathname of a program you wish to execute in your shell, you can use the which <program>
command in a standard bash terminal. For example, try the following:
$ which cat
/bin/cat
$ ./300sh
300sh> /bin/ls
300sh Makefile builtin.h builtin.o sh.cpp sh.o
300sh> /bin/cat builtin.h
... prints the contents of builtin.h ...
To get a sense of whether or not your shell is working properly, you can compare the output of your shell to a standard bash terminal for some simple commands such as the following:
cat <file>
(or /bin/cat <file>
)
echo <args>
(or /bin/echo <args>
)
cal
(or /usr/bin/cal
)
ls
(or /bin/ls
)
Note that 300sh
supports passing arguments to programs, but does not support output redirection (e.g., /bin/ls > /tmp/outfile
).
Now head to the grading server and run the checkoff for Lab 5.
You should see the "shell" test pass. Don't worry about others failing or partial credit; if you pass the "shell" test, you've done what you need to do for Part I of the lab.
Assignment Part II: Reverse Index
Introduction
One common paradigm of programming, perhaps the one with which you are most familiar, is sequential programming. Sequential programs execute instructions one at a time in a specific order. With the aid of multiple processes, however, we can also have concurrent programs. Concurrent programs execute in parallel, either on multiple processors in your computer or in a time-sliced way on a single processor.
You have experience writing concurrent programs using fork
already! fork
creates a new process in which a distinct task can be executed in parallel with existing processes.
One benefit to concurrent programming is that it can improve program performance by splitting one big task into smaller subtasks which can be executed at the same time rather than sequentially (assuming you're running on a multiprocessor computer). However, since multiple processes exist in distinct virtual address spaces, the data in one process’s address space is not visible to any other process. This becomes a problem when using multiple processes to speed up a task, as you will eventally want to pool all child process results together.
In this part of the lab, you will learn how to share data between processes via pipes. A pipe is a one-way channel in which data can be sent and received between processes. On Linux, it is represented as an array of two file descriptors with the first (element 0) representing the file descriptor for the read end of the pipe, and the second (element 1) representing the file descriptor for the write end of the pipe. Data can be read from and written to pipes using the standard read
and write
system calls.
Assignment Specifications
For this lab, you will use multiple processes to improve the performance of a reverse index program, which takes in a word and produces a list of the occurrences of the word in a corpus of documents.
This is the essence of how a search engine like Google works! When you type a keyword into Google, the Google servers look up your keyword in a giant reverse index, finding the web pages that mention this keyword. It then ranks the results according to a highly secret algorithm.
Google's reverse index is split over many servers, and speeds up lookups using concurrency. For this assignment, you will build a more modest, single-machine reverse index, but concurrency will still help speed it up!
In the multiProcs
directory of the Lab 6 stencil, we give you the following files:
wordindex.h
: This file contains all shared functions and data structures. This includes the wordindex
class for storing the results of a search, all functions involving reading files and directories, and functions to print out the results. You should not modify anything in this file, but you will definitely want to look at the functions and data structures it contains since you will have to use them.
revindex_sequential.cpp
: This file contains a single-threaded, sequential implementation of a reverse index search. You should use this as a baseline for the performance of your reverse index, and you may want to look at the code to get an idea of how the reverse index should work. However, you should not modify anything in this file.
revindex_procs.cpp
: This file contains a multiple process implementation of a reverse index search. This program should take advantage of parallel execution by creating multiple processes such that the reverse index of a term can be calculated for multiple files at once. Each file should be searched by its own process. You only need to modify this file to complete the lab.
In addition, we also give you some input files that are helpful for testing.
- The books directory containing a series of .txt files of classic novels
input.txt
: This file contains a series of search terms separated by a newline. It can be used for testing purposes to execute multiple searches by running ./revindex_parallel books < input.txt
WORD.txt
: all files called WORD.txt (besides input.txt
) contain the expected results of a reverse index of the term WORD
in the books directory. These files are meant for testing/debugging purposes.
Trying running make revindex_sequential
, and typing in a few terms to get a feel for how the revindex_sequential
works.
Task:
Implement the following functions in revindex_procs.cpp
. Below is a brief overview of what each function is resposible for. The comments in the stencil provide a more detailed explanation of each function.
process_input:
This function is responsible for setting up the pipes for each child process, creating the child processes, pooling the data from each child process, and waiting for them to complete.
process_file:
This function should be run by the child processes to search for occurrences of a given term in the child process’ delegated file (see `wordindex.h` for functions to use for this). This function is also responsible for writing the search results for the file a child process searched back to the parent process through a pipe.
read_process_results:
This function should be run by the parent process to read the results from a child process through a pipe.
When implenting this part of the lab, you may find details in the pipe section of the handout useful.
Hints
- If you're getting a timeout error, its not because multiprocessing is slow! Think about where in your code reading from a pipe could be blocked. How might the amount of data written effect this? Consider the case of trying to write a huge amount of data.
- Be careful of how you manage your processes. You'll want to know that all of your processes execute successfully, but keep in mind the purpose of this assignment: concurrency!
- Each process you fork will be responsible for processing one file, but each iteration of the loop in `process_input` starts several (up to
workers
) processes. You will need to ensure that you're passing the right filename from the filenames
vector to each process.
- If you are confused about how the pipes between the parent and child processes work, it may help to sketch a picture of which process runs what code and what data flows from one process to another.
- Recall that different processes live in different address spaces. When a worker process sends back results through a pipe, it cannot send the
wordindex
data structure directly (Why not? Think about where the memory backing the vectors it contains is allocated). You will need to serialize the data before writing it to the pipe; check out the helper functions we provide!
- The serialize/deserialize helpers we provide only serialize (encode) some members of the
wordindex
object. You will need to appropriately set the others after deserializing!
- Since the parent process does not know how many results a child worker process found, you'll face the question of how many bytes to read from the pipe using the
read
syscall. Check out the examples we provide with the explanation of pipe
for some ideas on how to determine this quantity.
- If you're debugging an issue, it may help to add logging to your code that prints the process ID (available via the
getpid()
system call) and what that process is trying to read or write to/from a pipe.
Running and Testing:
Compile:
- The whole project with
make all
- Just the provided sequential implementation with
make revindex_sequential
- Just the parallel implementation with
make revindex_parallel
Run:
The books directory is provided for you, but feel free to create your own directories containing fewer or shorter files for preliminary testing purposes.
- The sequential implementation with
./revindex_sequential <directory>
- The parallel implementation with
./revindex_parallel <directory>
Test:
We have provided you with the expected output for some search terms over the books directory (all files titled WORD.txt
), so that you can compare your results. You can also compare your implementation with the sequential implementation via make check
. The results should be the same.
$ make check
./revindex_parallel books < input.txt > output_parallel.txt
./revindex_sequential books < input.txt > output_sequential.txt
diff output_parallel.txt output_sequential.txt
... You should receive no output/differences ...
- Runtime: Additionally, once you have a working reverse index using parallel processes, you should find a performance improvement over the sequential implementation. To verify this, run:
$ time ./revindex_parallel books < input.txt
$ time ./revindex_sequential books < input.txt
... the sequential implementation should be slower than the parallel ...
Relevant System Calls
Below you will find a list of functions that you may find useful for this lab. Feel free to also view the man pages if you would like more details:
Fork:
Execv:
execv(const char* path, const char* args[])
- The execv system call is used to execute a program. In doing so, it completely replaces the image of the current process with that of the new program
- Note: this means that, when successful, execv does not return since the program that called it has been replaced and is no longer running
- Arguments:
- An example of execv’s usage is as follows:
char* path = ""/bin/cat";
char* args[] = {"cat", "file.txt", NULL};
execv(path, args);
perror("execv");
exit(1);
Wait and Waitpid:
It is often necessary for a process to wait for the child process to terminate before continuing execution. To accomplish this, you can use either wait or waitpid.
wait(int* status)
- The wait system call will return once any child of the calling process terminates
- wait has one argument: a pointer to an integer, which will contain the status of the terminated child once wait returns (this argument can be NULL if you do not care about the child's status)
- wait returns the process id of the terminated child process on success and -1 on failure
waitpid(pid_t pid, int* status, int options)
- The waitpid system call returns once the child process specified by the pid argument has changed state as dictated by the options argument.
- By default (when options=0) waitpid will return when the child process terminates, but other options can be specified.
- The status argument will hold the exit status of the terminated child, which can be
NULL
if this value is not needed
- waitpid return the process id of the terminated child process of success and -1 upon failure
Pipe:
pipe(int fd[2])
- The pipe system call creates a one way channel through which processes can share data.
- Upon return, the fd argument will contain 2 file descriptors (integers).
- fd[0] refers to the side of the pipe from which data can be read.
- fd[1] refers to the side of the pipe to which data can be written.
- Data that is written to fd[1] is buffered by the kernel until a read from fd[0]
- Below are two examples of using the pipe() system call:
int fd[2];
if(pipe(fd) == -1){
perror(“pipe”);
exit(1);
}
if(!fork()){
char *message = “Hello World!”;
write(fd[1], data, strlen(message));
close(fd[0]);
close(fd[1]);
exit(0);
}else{
char buf[256];
int loc = 0;
while(read(fd[0], &buf[loc], 1) > 0){
loc++;
}
fprintf(stdout, “%s\n”, buf);
wait(NULL);
close(fd[0]);
close(fd[1]);
}
int fd[2];
if(pipe(fd) == -1){
perror(“pipe”);
exit(1);
}
if(!fork()){
char *message = “Hello World!”;
int len = strlen(data);
write(fd[1], &len, sizeof(int));
write(fd[1], data, len);
close(fd[0]);
close(fd[1]);
exit(0);
}else{
int size;
read(fd[0], &size, sizeof(int));
char *buf = new char[size];
read(fd[0], buf, size);
fprintf(stdout, “%s\n”, buf);
wait(NULL);
close(fd[0]);
close(fd[1]);
delete[] buf;
}
Note: The second example first writes the size of the data and then the data itself. Since you know the order in which data is written to the pipe, you know that if you first read sizeof(int) bytes, you will read the length of the remaining portion of the data (the message). You then know by how much to read so that you can read the entirety of the message in just one more call to read.
In the first example, data is read from the pipe one byte at a time until read returns 0 (there is no more data left to read). In this example, passing data through the pipe results in 14 system calls, 1 write to the pipe, 12 reads to get the whole message and 1 more read to hit EOF. In contrast, the second example passing data through the pipe only uses 4 system calls: 2 writes and 2 reads. System calls are costly since they involve handing over control to the OS. As such, you will generally find your programs to be more efficient if you cut down on system calls.
Debugging Resources/ GDB
Debugging Multi-process Programs:
If you use GDB to help you debug this lab (as is always recommended!), it is important to note that the default behavior of GDB is to not follow the execution of child processes after they are started with fork. As such, if you set a breakpoint on a line that executes within a child process, GDB won’t break on that line.
However, you can change this behavior by running the GDB command set follow-fork-mode child
. This tells GDB to follow the execution of the child process after a fork, and leave the parent process to run freely (but not followed). After setting this, GDB will hit breakpoints you set within a forked process but not any breakpoints set in the parent process beyond the call to fork. If needed, run help set follow-fork-mode
within GDB for more details.
Handin instructions
Turn in your code by pushing your git repository to github.com/csci0300/cs300-s24-labs-YOURUSERNAME.git
.
Then, head to the grading server. On the "Labs" page, use the "Lab 5 checkoff" button to check off your lab.
« Back to the main CS 300 website
Lab 5: Processes
Due Thursday, April 18th at 8:00 PM EST
Assignment
Caution: This lab (and particularly part II) has historically been the most time-consuming lab for students, therefore we highly recommend starting and seeking help early!
Assignment Installation
First, ensure that your repository has a
handout
remote. Type:If this reports an error, run:
Then run:
This will merge our Lab 5 stencil code with your previous work. If you have any "conflicts" from Lab 4, resolve them before continuing further. Run
git push
to save your work back to your personal repository.C++ Tips
In both parts of this assignment, you will be using some C++ that might differ slightly from what has been previously used in C.
Arrays: To allocate memory for C++ arrays on the heap, you should use the
new
keyword. To free that memory, you would use thedelete
keyword. This is shown in the next section, which describes the function of thenew
anddelete
keyword.New/Delete: C++ lets you allocate heap memory via the new operator, which takes the name of a class (or struct) to instantiate an object in heap memory. For example, to make a heap-allocated array of ints, you would write:
The
new
keyword is essentially the C++ equivalent of mallocing memory and instantiating a object.When you no longer need memory created with
new
, you usedelete
to destroy it. In this specific case,delete[]
is used to delete dynamically allocated arrays.Vectors: The
vector
data type from the standard C++ library stores elements of similar data types in an array-like format. However, unlike the array, thevector
data type can grow dynamically. You can declare an empty vector variable and add elements as shown:Assignment Part I: Simple Shell
In the first part of this lab, you will write the process launching code for your own shell, called
300sh
.300sh
is similar to the shell you type your commands into (a program calledbash
) while running your Docker container, but it's a lot simpler.Introduction
Your shell needs to be able to start new processes via system calls into the OS kernel. For this purpose, you will use the
fork
andexecv
system calls discussed in lectures.What's the deal with
execv
? Thus far, you have primarily written programs which execute a series of instructions, often organized into functions. However, sometimes you'd like to write programs that have the ability to execute other programs. This can be done on Linux with theexec
system call. When one program executes another usingexecv
, the current process image is completely replaced with that of the new program. The process image comprises of the virtual address space of the current process, including the stack, the heap, the global variables (.data
) and the code (.text
) sections. All of these are replaced by the corresponding sections in the new program. Ifexecv
is successful, the original program will never be returned to as that process has been completely overwritten!Here’s an example usage of
execv
.Try running this program in your Docker container with the file name
example.cpp
.Running this program will produce this output:
$ gcc -o example example.cpp $ ./example hello world
Try changing the first argument of
execv
on line 12 to an invalid program name (ex:/bin/hello
) and see what happens.What's the deal with
fork
? What if you want to execute a new program while still being able to return to the original, rather than overwriting it with the new program? This is where the ability to create multiple processes becomes useful. Thefork
system call will create a new process (the "child process") whose address space is a copy of its parent process’s address space. However, while they map the same virtual addresses, these two processes run in separate address spaces, so changes to one do not affect the other!Here’s an example usage of
fork
.Try running this program in your Docker container or
repl.it
with the file nameexample.cpp
.Running this program will produce an output similar to this:
$ gcc -o example example.cpp $ ./example I am in the parent process. I am outside the if statement. Child is 1753 and x is 5. I am in the child process: x is 6. I am outside the if statement. Child is 0 and x is 6.
Note: Try running the program in your Docker container. (In online environments like repl.it, you may not see all the output.)
Since the child process is a copy of the parent's address space, it will perform the same instructions as its parent starting on line 10, and have a copy of its parent's stack, heap, and data (e.g., it has access to a copy of variable
x
).fork
will return a different value depending on whether it's in the child or parent process – allowing us to execute different instructions in the child and parent process (e.g., the parent process can't execute instructions within theif
statement, but the child can). Currently, the child can also execute instructions after theif
statement. Let's change that!Optional Task 1: Immediately after line 13, add the line
exit(0);
, indicating to exit from the child process with return code 0. Also, include#include <stdlib.h>
as a header.Optional Task 2: Immediately after line 14, declare a local integer called
status
and then add the linewaitpid(child, &status, 0);
, telling the parent to wait on the child to terminate before it continues executing instructions. Also, include#include <sys/wait.h>
as a header.Assignment Specifications
For this part of the lab, you will be completing the provided implementation of a simple shell by adding functionality for executing programs. You are provided with the file
sh.cpp
and its corresponding makefile. Currently,sh.cpp
contains a read-evaluate-print loop (REPL) which reads user input, and parses it into individual tokens. If you run./300sh
aftermake
, you'll see that the shell currently doesn't do anything, except for the builtin commandsexit
andcd
.$ ./300sh 300sh> ls 300sh> yo 300sh> hello 300sh> /bin/ls 300sh> cd /bin Current directory is now /bin 300sh> exit
Your task is to extend this functionality to be able to execute any arbitrary program (or fail if the program does not exist). This should all be done within the
do_fork
function in the stencil.When a process, in this case the shell, executes another program, the current process is completely replaced by the program it executes. In order for your shell to continue execution after running another program, it needs to create a new child process in which the new program can run as to not replace the shell’s execution. As such, if you create a new process (using
fork
) in which to execute a different program (usingexecv
), the original program will remain preserved in the parent process and its execution can continue rather than be replaced.Task:
Implement the
do_fork(char *args[])
insh.cpp
. This function should:args
), with the desired arguments (the following elements inargs
).The argument of
do_fork
,args
, is such that:NULL
ornullptr
(meaning that args is null terminated)For example, if
args
were the array:{"/bin/echo", "hello", NULL}
,do_fork
would create a child process to execute the program/bin/echo
with argumenthello
.Understanding the arguments of
execv
.The
execv
system call takes as arguments:(Make sure to handle the case where execv fails!)
Check out the example usages of
fork
andexecv
above for more help.When implementing this part of the lab you may find the fork, execv, and wait or waitpid system calls useful. Please read the Relevant System Calls section for more details.
Running and Testing
Compile and Run:
make
./300sh
When your shell is working, it should exhibit the same basic functionality as a regular bash terminal when running processes in the foreground. Just note that when executing programs in your shell, you will need to use the full pathname. For example, instead of typing
ls
as you usually would, you will have to use/bin/ls
. (This is the case because your shell does not automatically expand paths for binaries in standard directories like/bin
.)If you do not know the full pathname of a program you wish to execute in your shell, you can use the
which <program>
command in a standard bash terminal. For example, try the following:$ which cat /bin/cat $ ./300sh 300sh> /bin/ls 300sh Makefile builtin.h builtin.o sh.cpp sh.o 300sh> /bin/cat builtin.h ... prints the contents of builtin.h ...
To get a sense of whether or not your shell is working properly, you can compare the output of your shell to a standard bash terminal for some simple commands such as the following:
cat <file>
(or/bin/cat <file>
)echo <args>
(or/bin/echo <args>
)cal
(or/usr/bin/cal
)ls
(or/bin/ls
)Note that
300sh
supports passing arguments to programs, but does not support output redirection (e.g.,/bin/ls > /tmp/outfile
).Now head to the grading server and run the checkoff for Lab 5.
You should see the "shell" test pass. Don't worry about others failing or partial credit; if you pass the "shell" test, you've done what you need to do for Part I of the lab.
Assignment Part II: Reverse Index
Introduction
One common paradigm of programming, perhaps the one with which you are most familiar, is sequential programming. Sequential programs execute instructions one at a time in a specific order. With the aid of multiple processes, however, we can also have concurrent programs. Concurrent programs execute in parallel, either on multiple processors in your computer or in a time-sliced way on a single processor.
You have experience writing concurrent programs using
fork
already!fork
creates a new process in which a distinct task can be executed in parallel with existing processes.One benefit to concurrent programming is that it can improve program performance by splitting one big task into smaller subtasks which can be executed at the same time rather than sequentially (assuming you're running on a multiprocessor computer). However, since multiple processes exist in distinct virtual address spaces, the data in one process’s address space is not visible to any other process. This becomes a problem when using multiple processes to speed up a task, as you will eventally want to pool all child process results together.
In this part of the lab, you will learn how to share data between processes via pipes. A pipe is a one-way channel in which data can be sent and received between processes. On Linux, it is represented as an array of two file descriptors with the first (element 0) representing the file descriptor for the read end of the pipe, and the second (element 1) representing the file descriptor for the write end of the pipe. Data can be read from and written to pipes using the standard
read
andwrite
system calls.Assignment Specifications
For this lab, you will use multiple processes to improve the performance of a reverse index program, which takes in a word and produces a list of the occurrences of the word in a corpus of documents.
This is the essence of how a search engine like Google works! When you type a keyword into Google, the Google servers look up your keyword in a giant reverse index, finding the web pages that mention this keyword. It then ranks the results according to a highly secret algorithm.
Google's reverse index is split over many servers, and speeds up lookups using concurrency. For this assignment, you will build a more modest, single-machine reverse index, but concurrency will still help speed it up!
In the
multiProcs
directory of the Lab 6 stencil, we give you the following files:wordindex.h
: This file contains all shared functions and data structures. This includes thewordindex
class for storing the results of a search, all functions involving reading files and directories, and functions to print out the results. You should not modify anything in this file, but you will definitely want to look at the functions and data structures it contains since you will have to use them.revindex_sequential.cpp
: This file contains a single-threaded, sequential implementation of a reverse index search. You should use this as a baseline for the performance of your reverse index, and you may want to look at the code to get an idea of how the reverse index should work. However, you should not modify anything in this file.revindex_procs.cpp
: This file contains a multiple process implementation of a reverse index search. This program should take advantage of parallel execution by creating multiple processes such that the reverse index of a term can be calculated for multiple files at once. Each file should be searched by its own process. You only need to modify this file to complete the lab.In addition, we also give you some input files that are helpful for testing.
input.txt
: This file contains a series of search terms separated by a newline. It can be used for testing purposes to execute multiple searches by running./revindex_parallel books < input.txt
WORD.txt
: all files called WORD.txt (besidesinput.txt
) contain the expected results of a reverse index of the termWORD
in the books directory. These files are meant for testing/debugging purposes.Trying running
make revindex_sequential
, and typing in a few terms to get a feel for how therevindex_sequential
works.Task:
Implement the following functions in
revindex_procs.cpp
. Below is a brief overview of what each function is resposible for. The comments in the stencil provide a more detailed explanation of each function.process_input:
This function is responsible for setting up the pipes for each child process, creating the child processes, pooling the data from each child process, and waiting for them to complete.process_file:
This function should be run by the child processes to search for occurrences of a given term in the child process’ delegated file (see `wordindex.h` for functions to use for this). This function is also responsible for writing the search results for the file a child process searched back to the parent process through a pipe.read_process_results:
This function should be run by the parent process to read the results from a child process through a pipe.When implenting this part of the lab, you may find details in the pipe section of the handout useful.
Hints
workers
) processes. You will need to ensure that you're passing the right filename from thefilenames
vector to each process.wordindex
data structure directly (Why not? Think about where the memory backing the vectors it contains is allocated). You will need to serialize the data before writing it to the pipe; check out the helper functions we provide!wordindex
object. You will need to appropriately set the others after deserializing!read
syscall. Check out the examples we provide with the explanation ofpipe
for some ideas on how to determine this quantity.getpid()
system call) and what that process is trying to read or write to/from a pipe.Running and Testing:
Compile:
make all
make revindex_sequential
make revindex_parallel
Run:
./revindex_sequential <directory>
./revindex_parallel <directory>
Test:
$ make check ./revindex_parallel books < input.txt > output_parallel.txt ./revindex_sequential books < input.txt > output_sequential.txt diff output_parallel.txt output_sequential.txt ... You should receive no output/differences ...
$ time ./revindex_parallel books < input.txt $ time ./revindex_sequential books < input.txt ... the sequential implementation should be slower than the parallel ...
Relevant System Calls
Below you will find a list of functions that you may find useful for this lab. Feel free to also view the man pages if you would like more details:
Fork:
fork()
pid_t
(an integer value)execv
is called. Among other things, this means that child process has its own copy of all variables set by the parent prior to the call to fork()!Execv:
execv(const char* path, const char* args[])
NULL
. An example arguments array is as follows:Wait and Waitpid:
It is often necessary for a process to wait for the child process to terminate before continuing execution. To accomplish this, you can use either wait or waitpid.
wait(int* status)
waitpid(pid_t pid, int* status, int options)
NULL
if this value is not neededPipe:
pipe(int fd[2])
Note: The second example first writes the size of the data and then the data itself. Since you know the order in which data is written to the pipe, you know that if you first read sizeof(int) bytes, you will read the length of the remaining portion of the data (the message). You then know by how much to read so that you can read the entirety of the message in just one more call to read.
In the first example, data is read from the pipe one byte at a time until read returns 0 (there is no more data left to read). In this example, passing data through the pipe results in 14 system calls, 1 write to the pipe, 12 reads to get the whole message and 1 more read to hit EOF. In contrast, the second example passing data through the pipe only uses 4 system calls: 2 writes and 2 reads. System calls are costly since they involve handing over control to the OS. As such, you will generally find your programs to be more efficient if you cut down on system calls.
Debugging Resources/ GDB
Debugging Multi-process Programs:
If you use GDB to help you debug this lab (as is always recommended!), it is important to note that the default behavior of GDB is to not follow the execution of child processes after they are started with fork. As such, if you set a breakpoint on a line that executes within a child process, GDB won’t break on that line.
However, you can change this behavior by running the GDB command
set follow-fork-mode child
. This tells GDB to follow the execution of the child process after a fork, and leave the parent process to run freely (but not followed). After setting this, GDB will hit breakpoints you set within a forked process but not any breakpoints set in the parent process beyond the call to fork. If needed, runhelp set follow-fork-mode
within GDB for more details.Handin instructions
Turn in your code by pushing your git repository to
github.com/csci0300/cs300-s24-labs-YOURUSERNAME.git
.Then, head to the grading server. On the "Labs" page, use the "Lab 5 checkoff" button to check off your lab.