Project 5A: Concurrent Store 🗂🗃🗄

Due: Friday, April 26 at 6:00 PM EST

Introduction

A key-value store is a storage system in which data (a value) is addressed by a key; common examples are Python's dictionaries or Java/C++ hash maps. Key-value stores are popular for large-scale web applications because they are fast and scalable.

Clients interface with the key-value store through a server. Once client(s) connect to the server, they can issue requests to get, put, append, and delete key-value pairs from the store.

In this project, you'll implement two kinds of key-value stores. You'll then extend your implementation to support concurrent client requests.

Production applications often handle so much data (and so many client connections) that one key-value store is not enough – they need to divide (i.e., shard) their data between multiple stores. In Project 5B, you'll extend your implementation to support such a distributed system.

Learning Objectives

This assignment will help you:

Familiarize yourself with synchronization primitives, namely mutexes.
Think about problems introduced by concurrency, like deadlocks and race conditions on shared data structures.
Learn about key networking abstractions, the "client" and "server," and the roles of each.
Understand the implications of privacy laws for storage systems, and how developers need to implement technical functionality to give users rights over their data.
Reason about trade-offs between stakeholder interests and public goods relating to privacy and data protection.

Assignment Installation

Ensure that your project repository has a handout remote. Type:

$ git remote show handout

If this reports an error, run:

$ git remote add handout https://github.com/csci0300/cs300-s24-projects.git

Then run:

$ git pull
$ git pull handout main

This will merge our Project 5 stencil code with your repository.

Once you have a local working copy of the repository that is up to date with our stencils, you are good to proceed.

Infrastructure Help

Stencil and Initial State

Project 5B also uses this stencil, so there is more code in the repository than you will need for this assignment. These are the directories you'll need for this assigment:

Directories	Contents
`build/`	See "Compile and Run".
`tests/` and `test_utils/`	See "Testing".
`client/`	The `simple_client` implementation you'll run in this assignment, along with the commands it supports. Also has the stencil for `shardkv_client`, which you'll implement in Project 5B.
`gdpr/`	The `database.csv` file you'll work with in the Tweeter portion of the assignment
`kvstore/`	The definition of the abstract `KvStore` interface, as well as the `SimpleKvStore` and `ConcurrentKvStore` classes.
`server/`	The definition of the `KvServer` class and support code.

If you're curious, descriptions of the rest of the directories are below:

Directories	Contents
`cmd/`	The runner code that starts the client, server, and shardcontroller. You should not modify these.
`net/`	The networking support code used by the clients and servers. You should not modify these.
`repl/`	Utilities for running the client, server, and shardcontroller on the command line. You should not modify these.
`common/`	The `Shard` class and some helpers for managing or operating on them. Also contains miscellaneous utility functions, such as color printing. These will both be useful in Project 5B.
`shardcontroller/`	The definition of the `shardcontroller` class, and support code to process client requests. You'll implement the shardcontroller methods in Project 5B.

Compile and Run

To compile, cd into the build/ directory and run make. This will create two executables: simple_client andserver.

To run the server, specify the port, and optionally the number of workers:

$ ./server <port> [n_workers]

To run the simple client, specify the server address:

$ ./simple_client <server hostname:port>

Example Usage

To connect multiple clients to a server, start by opening multiple terminals (or use tmux).

Terminal 1: runs server on port 5000 with the default number of workers:

$ ./server 5000

Your server will then be listening on <hostname>:5000; the server's full address will be displayed in the terminal.

Terminal 2: runs simple_client to connect to the server, using the address displayed above (in this example, the hostname is 65a4af2c28ae):

$ ./simple_client 65a4af2c28ae:5000

The server defaults to using your SimpleKvStore. To change that, edit the line this->store = std::make_unique<SimpleKvStore>(); in server.cpp to use a ConcurrentKvStore.

Using the Client

Once connected to the server, you can use one or more clients to run standard key-value store operations, like get, put, append, and delete. You can also use two additional requests, multiget and multiput (for multiple key-value pairs). For specific usage instructions, type help in the command prompt.

To exit the client (and any other executable), press Ctrl-D (or Cmd-D); this sends End-Of-File (EOF) to the program, causing it to exit cleanly. If your program hangs, use Ctrl-C (or Cmd-C) to force-quit the program.

Testing

We strongly recommend you test your code often while implementing, by compiling and running a server with a few clients (as shown above).

To run tests, you must be in the build/ directory. Then, run make check.
To run the tests for an individual step # , you can run make check A#. For example, to run the tests for step 1, you can run make check A1.

The tests are in the tests/ directory, with each subdirectory corresponding to a part of the assignment. After implementing each step, the testing harness will run all tests related to that step. Specifically, the steps will run all tests in the following directories:

Step 1: kvstore_sequential_tests (A1)
Step 2: kvstore_parallel_tests (A2)
Step 3: kvstore_sequential_tests (A3)
Step 4: kvstore_parallel_tests(A4) and kvstore_performance_tests (A5)

If you're failing tests, your first step to debug should be to run them individually. Each test prints why it failed, but make check suppresses that output.

To run an individual test, run that specific test's executable. For instance, for test_put_get_delete, run in the build/ directory:

make test_put_get_delete
./test_put_get_delete

If nothing prints, the test passed. Otherwise, you'll see an error message.

Example Error Message

$ ./test_put_get_delete
Assert failed in file test_put_get_delete.cpp on line 26: 
Expected first argument  to equal second argument x1Ha4M0dLCkK

The error message directs us to this assertion:


ASSERT_EQ(get_res.value, vals[i]);

so the error message is telling us that get_res.value (the first argument) is empty, but the test expected it to be equal to the second argument, vals[i], or x1Ha4M0dLCkK.

The correctness tests can also optionally specify which type of KVStore to run (SimpleKvStore or ConcurrentKvStore) by passing in either simple or concurrent as an argument:

./test_put_get_delete simple

The tests default to the ConcurrentKvStore.

If you want to write your own tests, add new files to the subdirectories in tests/. Be sure to take a look at the functions in test_utils/test_utils.hpp –- they will make writing your own tests much easier.

Assignment

Step 1: Single-Threaded Simple KV Store

In this section, you'll implement a single-threaded simple key-value store. By "single-threaded," we mean that only one thread can safely operate on the KVStore at a time. This means that for Step 1, you should not worry about concurrency. You may safely assume that only one thread will ever be operating on your KVStore at a time.

Take a look at how we define a SimpleKvStore (in kvstore/simple_kvstore.hpp):

class SimpleKvStore : public KvStore {
 public:
    SimpleKvStore() = default;
    ~SimpleKvStore() = default;

    bool Get(const GetRequest *req, GetResponse *res) override;
    bool Put(const PutRequest *req, PutResponse *) override;
    bool Append(const AppendRequest *req, AppendResponse *) override;
    bool Delete(const DeleteRequest *req, DeleteResponse *res) override;
    bool MultiGet(const MultiGetRequest *req, MultiGetResponse *res) override;
    bool MultiPut(const MultiPutRequest *req, MultiPutResponse *) override;

    std::vector<std::string> AllKeys() override;

 private:
    // Your internal key-value store implementation!
};

Task: Define your internal key-value store. What data structure might be useful here? You may find this reference page useful.

Now it's time to implement each of these methods. First, look at how we define the Request and Response structs in net/server_commands.hpp. The Request object contains the information being requested. The Response object is where your methods should store the response to the request (if a response is needed). For the AllKeys operation, return a vector of the keys. For the other operations, return true if the operation is successful; otherwise, return false.

Give me an example!

A call to Get(const GetRequest *req, GetResponse *res) takes in a pointer to a GetRequest, which has a key field. Your method should get the value corresponding to key from your internal key-value store and store it in the value field of the GetResponse that res points to.

If a Response struct is empty, that means you don't need to provide an explicit confirmation that you fulfilled the request. For example, for a call to Put(const PutRequest *req, PutResponse *res), you should insert the key and value from the PutRequest that req points to into your internal key-value store. Since PutResponse is an empty struct, you don't need to do anything with it.

Your key-value store will support the following operations:

Function	Arguments	Summary
`Get`	`<key>`	Gets the value associated with `<key>` from the key-value store. Fails if it does not exist.
`Put`	`<key>`, `<value>`	Sets `<key>` to `<value>`; replaces existing values.
`Append`	`<key>`, `<value>`	Appends `<value>` to the current value associated with `<key>`. Creates a new key-value pair if it does not already exist.
`Delete`	`<key>`	Deletes `<key>` from the key-value store, returning the last saved value. Fails if it does not exist.
`MultiGet`	`<key_1> ... <key_n>`	Gets the values associated with keys `<key_1> ... <key_n>` from the key-value store. Fails if any key does not exist.
`MultiPut`	`<key_1> ... <key_n>, <value_1> ... <value_n>`	Sets `<key_1> ... <key_n>` to `<value_1> ... <value_n>`, respectively.
`AllKeys`	None	Returns all of the keys from the database.

Task: Implement the methods in kvstore/simple_kvstore.cpp.

When you are finished, you should pass all kvstore_sequential_tests. You can check that your implementation is correct by running in the build/ directory:

$ make check A1

If you're failing a test, you should run it individually to see debug output (replace test_name with the name of the test):

$ make test_name
$ ./test_name simple

The tests default to using the ConcurrentKvStore (which you'll implement in Steps 3 and 4), so you'll need to explicitly tell each test to use the SimpleKvStore by passing in simple as an argument.

Step 2: Concurrent Simple KV Store

We would like our server to be able to handle multiple requests in parallel. However, our simple key-value store can't safely handle concurrent requests yet! In this step, you'll make your simple key-value store thread-safe.

Implementation

Task: Add a synchronization primitive to SimpleKvStore in kvstore/simple_kvstore.hpp to make it thread-safe. You should use at most one mutex in this part. (We will explore more efficient locking schemes in later steps.)

Task: Modify the methods in kvstore/simple_kvstore.cpp to enforce synchronized access, using the primitive you defined in the previous task.

When you are finished, you should pass all kvstore_parallel_tests. You can check that your implementation is correct by running in the build/ directory:

$ make check A2

If you're failing a test, you should run it individually to see debug output (replace test_name with the name of the test):

make test_name
./test_name simple

Step 3: Single-Threaded Bucket-Based KV Store

Overview

Operations on your simple key-value store are thread-safe and concurrent, but there is no actual parallelism. Since two threads cannot modify the store at the same time, it is thread-safe. However, the entire map gets locked every time a thread wants to perform an operation on the store, which means that there are no parallel (i.e. simultaneous) operations.

This single mutex protection strategy is an example of coarse-grained locking: one lock protects access to a large section of code or an entire data structure. Fine-grained locking attempts to reduce lock contention by splitting the data structure or code into smaller, potentially disjoint pieces that each have their own lock. Multiple threads can then operate on separate pieces in parallel.

However, we have to be careful that a fine-grained locking scheme doesn't introduce subtle concurrency bugs. Also, having too many mutexes can introduce overhead that offsets the benefits of the fine-grained locking strategy.

Example of Coarse vs. Fine-Grained locking

A data structure you could implement coarse or fine-grained locking would be an array.

Coarse-grained locking would imply locking the entire array when any of it needs to be modified. This means that even if only a small part of the array will be modified, no other threads can access any part of the array.

This would look something like having an array-specific mutex.











// This may appear somewhat tedious
my_array_mtx.lock();
my_array[0] = 123;
my_array_mtx.unlock();
    
// But this locking is a lot easier on the computer
my_array_mtx.lock();
for(int i = 0; i < sizeof(my_array) / sizeof(int); i++){
   my_array[i] = i + 3;                                
}
my_array_mtx.unlock();

Alternatively, you could have fine-grained locking. This would involve ensuring that, instead of the entire array being unavailable, only the specific components changed will be unavailable to other threads. Thus, every "slot" in the array would need a corresponding mutex. While this allows for greater flexibility for threads, this structure may be a lot heavier on the system (because one would need to create more mutexes).

This would look something like:











// This is a lot simpler than before
my_array_mtx_array[0].lock();
my_array[0] = 123;
my_array_mtx_array[0].unlock();
    
// But this is a lot more tedious
for(int i = 0; i < sizeof(my_array) / sizeof(int); i++){
    my_array_mtx_array[i].lock();
    my_array[i] = i + 3;  
    my_array_mtx_array[i].unlock();
}

Notice in the above example, the scalability isn't too great either – this is because we chose a too-fine grain 🌾 Thus, it's always important to find a balance between approaches.

Another very cool example of fine and coarse-grained locking is locking a linked list! Feel free to examine this link for more information.

Implementation

If the single mutex locking strategy from Step 2 is too coarse-grained, and one lock for each key-value pair is too fine-grained, how can we design an efficient and scalable KVStore? We need to revisit our internal store design to support a new locking scheme–one where there's more than one lock, but fewer locks than the number of keys.

For this part, you will implement the key-value store API as a bucket-based hash table.

Take a look at how we define a ConcurrentKvStore (in kvstore/concurrent_kvstore.hpp):



















class ConcurrentKvStore : public KvStore {
 public:
    ConcurrentKvStore() = default;
    ~ConcurrentKvStore() = default;

    bool Get(const GetRequest *req, GetResponse *res) override;
    bool Put(const PutRequest *req, PutResponse *res) override;
    bool Append(const AppendRequest *req, AppendResponse *res) override;
    bool Delete(const DeleteRequest *req, DeleteResponse *res) override;
    bool MultiGet(const MultiGetRequest *req, MultiGetResponse *res) override;
    bool MultiPut(const MultiPutRequest *req, MultiPutResponse *res) override;

    std::vector<std::string> AllKeys() override;

 private:
    // Your internal key-value store implementation!
    DbMap store;
};

Notice that your concurrent key-value store will support the same operations as the SimpleKvStore, so refer to Step 1 for descriptions of each method. Your internal store is defined as type DbMap. We have already implemented the DbMap struct for you. Your responsibility is to write the methods in kvstore/concurrent_kvstore.cpp.

For this part, you may assume that only one thread will operate on your bucket-based KVStore at a time. We will add support for synchronized access in Step 4.

You are required to keep your internal store as type DbMap, and to maintain the internal structure of the DbMap (i.e., no changing this line):

std::array<std::list<DbItem>, BUCKET_COUNT> buckets;

However, you are not required to use getIfExists, insertItem, or removeItem if you don't want to. We provide them because we think they will be helpful helper methods for you to work with your internal store, but if you'd rather write different ones (or do all of your work in kvstore/concurrent_kvstore.cpp), feel free.

Task: Use the given DbMap implementation to fill in the function bodies for Get, Put, Append, Delete, MultiGet, MultiPut, and AllKeys in kvstore/concurrent_kvstore.cpp.

When you are finished, you should pass all kvstore_sequential_tests. You can check that your implementation is correct by running in the build/ directory:

make check A3

If you're failing a test, you should run it individually to see debug output (replace test_name with the name of the test):

make test_name
./test_name

For Steps 3 and 4, you can either specify concurrent as an argument or just leave it blank (as in the example above).

Step 4: Concurrent Bucket-Based KV Store

Overview

In this part, you'll modify your bucket-based key-value store implementation to support concurrent and parallel access using readers-writer locks.

Readers-writer locks

Recall from class the basic rule of synchronization:

If two or more threads can concurrently access an object, and at least one of the accesses is a write, a race condition can occur and synchronization is required.

Observe that synchronization is only required if at least one thread is writing to shared data. If both threads are reading, then they can safely read concurrently. Regular mutexes, however, do not address this distinction–only one thread can access a critical section at a time, so such concurrent reads are impossible.

One potential optimization is a readers-writer lock, or RWLock (std::shared_mutex in C++). RWLocks allow for concurrent access for read-only operations, but exclusive access for write operations. Before entering a critical section, threads may attempt to acquire either a shared (or read) lock or an exclusive (or write) lock. Multiple threads may hold a shared lock at once, but like regular mutexes, only one thread may hold an exclusive lock. To ensure safe concurrent or exclusive access, reader threads cannot acquire a shared lock until there is no writer thread holding an exclusive lock.

As with mutexes, it is up to the programmer to properly protect shared data and lock critical sections within the program. RWLocks present an additional challenge over mutexes: programmers must also ensure that threads acquire the correct type of lock before entering a critical section or potentially modifying data. For example, acquiring a shared lock and then modifying shared data is undefined behavior. As you implement key-value store functions, keep in mind which functions are amenable to concurrent access, and which require exclusive access.

Example of RWLocks

To acquire exclusive access to a critical section, use the lock and unlock functions. To acquire shared access, use lock_shared and unlock_shared:

shared_mtx.lock_shared(); // acquire/release a shared lock
/*
 * Do work here. Do not modify any shared data in this critical section!  
 */
shared_mtx.unlock_shared();

shared_mtx.lock(); // acquire/release an exclusive lock
/*
 * Do work here. You can access and modify shared data here.  
 */
shared_mtx.unlock();

See the std::shared_mutex class for complete documentation.

In addition, like regular mutexes, C++ provides a wrapper class, std::shared_lock, that automatically unlocks the shared mutex once it goes out of scope. This has the same API as std::unique_lock:

void foo() {
    // On initialization, a shared lock is acquired on `shared_mtx`.
    std::shared_lock guard(shared_mtx);

    /*
     * Do work here. Do not modify any shared data here!
     * 
     * On function exit (where `guard` goes out of scope), the shared lock is automatically released.
     */
}

Optional: Why not just use RWLocks all the time?

Especially in applications that experience read-heavy workloads, RWLocks permit more concurrency over regular mutexes. However, RWLocks should not be used as a drop-in replacement for mutexes: RWLocks come with additional overhead over simple mutexes, as they must contain additional state and logic to handle readers and writers. RWLocks also must avoid reader or writer starvation, where one operation is disproportionately favored. Wikipedia's article provides an introduction to potential RWLock implementations and problems.

Task: Add RWLocks to DbMap to synchronize access to your buckets.

You may not use a single, global mutex. If you find yourself writing code like this:

global_mtx.lock();
// < your code here >
global_mtx.unlock();

you are doing something wrong. You will not pass the performance tests, and may face further deductions during grading. The purpose of this step is to support thread-safe, parallel access to the store–a single lock prevents parallelism!

Hints

Think about a more fine-grained approach that makes concurrent access to your store more efficient! See Part 3 for an explanation of coarse-grained vs. fine-grained locking. How can you use multiple RW locks such that multiple threads can operate on your store in parallel?

Synchronization Considerations

Now, it's time to use the primitives you defined to synchronize access to your key-value store. Here are two hints before you get started:

Enforce a consistent locking order. For example, if a given operation requires lock1 and lock2, you should enforce that every operation always attempts to lock lock1 before lock2 (or vice versa). Think about why this strategy is necessary to prevent deadlock!
Acquire all relevant locks before accessing or modifying the store. We want each operation in our key-value store to be atomic, meaning that they are each executed as a single, indivisible unit. In this context of a key-value store, other operations should not modify relevant buckets while the MultiGet, MultiPut, and AllKeys operations are running.

Consider two concurrent operations: one MultiGet that requires lock1 and lock2, and one Delete that requires lock1. Imagine the following execution order:
- The MultiGet acquires lock1, gets value1 associated with key1, then releases lock1.
- The Delete acquires lock1, deletes the key1: value1 entry, then releases lock1.
- The MultiGet acquires lock2, gets some values, than releases lock2.
When the MultiGet returns, it will returns a value (value1) that is no longer in the store. We can prevent this bug by ensuring that the MultiGet owns both lock1 and lock2 before proceeding, and does not release either until it has finished entirely.

Task: Modify the methods in kvstore/concurrent_kvstore.cpp to enforce synchronized access, using the primitives you defined in the previous task.

When you are finished, you should pass all kvstore_parallel_tests. You can check that your implementation is correct by running in the build/ directory:

$ make check A4

If you implemented fine-grained locking correctly, you should also pass the performance tests. You can run the performance tests by running in the build/ directory:

$ make perf

Your multi-threaded solution with fine-grained locking needs to be at least 2x faster than (i.e., achieve twice the throughput of) the single-threaded, sequential implementation.

At this point, your implementation should pass all of the Part A tests! You can test this by running in the build/ directory:

$ ./run_tests concurrent_store

$ make check concurrent_store

Debugging with Sanitizers and GDB

Debugging concurrent code can be trickier than debugging single-threaded code. As threads interact with the shared data structures, it can be difficult to track down data races and deadlocks. Luckily, GDB and thread sanitizers can be really helpful.

Thread Sanitizers

Google’s ThreadSanitizer for C and C++ can help you detect data races, deadlocks, and thread leaks. To run the code with thread sanitizers, compile it with:

make TSAN=1

When compiling tests, you will need to specify TSAN=1 too:

make check TSAN=1

./test.sh TSAN=1 <A1|A2|A3|A4|A5>

Example

Consider the following example:

int count = 0;

int main() {
    // Start child thread, which calls child()
    std::thread p1(child);
    std::thread p2(child);
    // Only parent executes here
    p1.join();
    p2.join();

    printf("count %d\n", count); 
}

int child() {
    // Child executes here
    while(count < 100) {
        std::this_thread::sleep_for(std::chrono::milliseconds(1));
        count += 1;
    }
    return 0;
}

When you run this code using a thread sanitizer, it will produce the following output:

Note: Open the image in a new tab for a larger image.

The red message indicates what error is occurring (in this case, a data race)
The blue messages indicate conflicting memory accesses with a stack trace. In our case, the conflict occurs on line 25 and line 27. It includes which thread is performing each access, and if the thread holds a mutex it will also indicate it.
Below these messages, the sanitizer might also indicate:
- specific mutexes, when they were aquired, which threads acquired them, and sometimes when they were created
- threads, and when they were created
Refer to this ThreadSanitizerReportFormat for more information on the report format.

GDB

Sometimes if your program hangs due to a deadlock, the thread sanitizer may not be able to return any reports. In these cases, gdb has some more helpful tools. Here are some useful gdb commands:

info threads: prints out existing threads, the Light Weight Processor (LWP) they’re running on, and the line of code they’re executing. The current thread will have an asterisk next to it.

3    Thread 0x7ffff3bff700 (LWP 9167) "example" __lll_lock_wait ()
    at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
4    Thread 0x7ffff33fe700 (LWP 9003) "example" __lll_lock_wait ()
    at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135

thread <thread-id>: switches to thread <thread-id>

(gdb) thread 4
[Switching to thread 4 (Thread 0x7ffff33fe700 (LWP 9003))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135	../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S

p <mutex>: Prints the fields of a mutex. Helpful for determining the owner of the lock identified by its Light Weight Processor.
CTRL+Z: Stops the execution of the program. If your program is hanging likely due to a deadlock, this allows you to examine the current state of the program. For example, if one of your threads is blocked on a mutex, you can switch to that thread, see which thread owns the mutex, and then switch to that thread.
bt: Prints out a stacktrace. Often times, threads are blocked on a system call (locking a mutex, waiting on a condition variable, or reading/writing to a file descriptor). Backtrace is useful for discovering where in the codebase the blocking call was called.
f <frame-number>: Switches to the provided frame number. Allows you to switch to frames within files you’re familiar with and determine which function called blocking system calls.

Tweeter

Context

Now that you've implemented a simple key-value store, let's build upon it with a use case!

In this part of the assignment, you’ll be working with a specialized and more complicated key-value store for a new social media platform, Tweeter. On Tweeter, users can choose their usernames, write posts that appear on their profiles, and respond to other users’ posts (which appear on both users’ profiles).

In Tweeter’s database, there are five kinds of key-value pairs.

Key Value Pair Structure	Example Return Value
`user_id` → username	“user_14” → “malte”
`post_id` → post content	“post_59” → “Hello, Tweeter!”
`all_users` → comma-separated list of `user_ids`	“all_users” → “user_13,user_14,user_160”
`user_id_posts` → comma-separated list of `post_ids` that user has posted	“user_14_posts” → “post_59,post_1”
`post_id_replies` → comma-separated list of `post_ids` that respond to post	“post_59_replies” → “post_60, post_61”

Even though there can be multiple users with the same usernames, every user_id is unique. The same goes for posts: even though there can be multiple posts with the same text, every post has a unique post_id.

Loading your KVStore with data

We provide you with the data stored on Tweeter's instance of KVStore here. The same data is also in the gdpr/database.csv, and you can load it into your KVStore as follows.

In one terminal, run this command from the build directory to start a KVStore server listening on port 1234 (you can pick any number for the port, it just sets up a rendevous point for your client to connect with the server):

build$ ./server 1234

In another terminal, run the client and feed in the data:

build$ ./simple_client 127.0.0.1:1234 < ../gdpr/database.csv

After running this command, typing print store into the first terminal (which runs the server) should show that your KVStore contains our dataset. Note that since the KVStore is an in-memory store, you will need to re-load the dataset every time you restart the server, as it loses its contents on shutdown.

Now, a client can connect to the server and make API requests; for instance, to fetch user_1's data from the server, start up a new client and make a Get request:

build$ ./simple_client 127.0.0.1:1234
get user_1

Regulatory Background

When implementing a real-world storage system handling personal data like Tweeter, it is essential to understand the legal frameworks that govern data privacy. In this section of the assignment, you will implement features pertaining to privacy regulations, with a particular focus on the European Union's General Data Protection Regulation (GDPR). This regulation is designed to ensure that users have agency over their data, including the right to both access and delete their information online.

Task: Please read the specific requirements of the GDPR at the following links.

What about the United States?

The US is moving slower in data protection laws than the EU, but some states have created GDPR-like regulations such as the California Consumer Privacy Act(CCPA).

You can read about its equivalent policies here:

In a nutshell… 🥜

Right to Access

GDPR Right of Access (Article 15) – Users can receive a copy of their own data upon request, as well as the source of any data that was not collected from the subject. When relevant, they can also request: the reason their data has been processed, the categories of data being processed, who can access this data, and the duration for which their data will be stored. They must also be informed when their data is transferred to a third party.

And the United States?

CCPA Right to Know – Users can request access to personal data from companies, as well as: the categories of their data, the categories of sources, the purpose of collecting that data, and information about what categories of data are being transferred to what types of third parties.

What’s the difference?

Not much – for our application they share broadly similar data access rights!

Right to Delete:

GDPR Right to be Forgotten (Article 17) – Users can request the deletion of the personal data under six specific circumstances, including when it is no longer needed for the purpose it was collected, or the user has revoked consent. Companies are not required to comply if the data is essential for: avoiding infringing upon freedom of expression, complying with legal requirements, engaging in scientific or historical research, or reasons in the interest of public health.

And the United States?

CCPA Right to Delete – Users can request the deletion of all personal data at any time. Companies are not required to comply if the data is essential for: fulfilling the purpose for which it was collected, ensuring security and integrity, identifying and repairing bugs in functionality, avoiding infringing upon the rights of another consumer (e.g. to free speech), engaging in scientific, historical, or statistical research, “enabling solely internal uses that are reasonably aligned with the expectations of the consumer,” or complying with legal requirements.

What’s the difference?

The primary difference is that “the GDPR right only applies if the request meets one of six specific conditions while the CCPA right is broad. However, the CCPA also allows business to refuse the request on much broader grounds than the GDPR” (Practical Law).

Stakeholders

Tweeters has users who are in the EU, so Tweeter must comply with the GDPR’s right to access and the right to be forgotten. You are in charge of making a decision on how to handle a particular user's request to exercise their right to access and their right to be forgotten.

Task: Check EdStem! You will find a pinned post containing information about which stakeholder pair (or pairs) you have been assigned to complete this portion of the assignment.

If you have been assigned two stakeholder pairs, please use the first one for all parts of the assignment. You will use the second pair to answer question 3B later!

You will write a function GDPRDelete() that performs a delete request for your assigned stakeholder. There are several ways one might implement privacy-conscious deletion (think about what we discussed in section for ideas!). While many different kinds of delete can be acceptable for your stakeholder pair, you cannot opt out of implementing some form of delete and reject their request, although your implementation may not meet all of their hopes and expectations. (In the real world, there are cases where deletion requests have been rejected completely, but for the scope of this assignment, we’ve selected the dataset and the stakeholders such that you should be able to implement some kind of delete, even if it is an unsatisfactory one for your stakeholder.)

The caveat with this particular assignment is that you are also accountable to an ethical auditor who is charged with scrutinizing your company’s handling of GDPR compliance. Therefore, when satisfying your stakeholder’s request for deletion in this assignment, you must also consider other stakeholders that could be affected by this deletion and whose legitimate claims and interests might be infringed upon.

The following introduces each stakeholder pair and explains the context in which the deletion request occurs:

PAIR #1:

Data Subject: Congressperson Kirby — Congressperson Kirby is currently the elected congressperson for Rhode Island’s first district and is running for reelection in the current election cycle. Recently, some of their tweets from when they were in college have resurfaced. These tweets were written 10 years ago concerning a pandemic that occurred at the time. While the congressperson has issued apologies, this has not been enough to stop the ongoing discourse from users all across the political spectrum. Because of this, the congressperson has made a request to exercise their right to be forgotten. They wish to delete their account, which they hope will delete their original tweets and related public discourse (other users quoting or rewording the original posts) about the controversy.
Opposing stakeholder: Freedom House Advocacy Group — This advocacy group is concerned with government accountability & transparency. They oppose Kirby’s attempt to remove evidence of their controversy from the past.

PAIR #2:

Data Subject: Sarsra Breisand — Sarsra Breisand is an American singer, actress and director. With a career spanning over six decades, she has achieved success in multiple fields of entertainment, and is among the few performers awarded an Emmy, Grammy, Oscar, and Tony (EGOT). Recently, a paparazzi reporter leaked information about where Sarsra Breisand lives. Despite her attempts to hide this information, she draws more attention to this leaked information. As a desperate final attempt to protect her privacy, Sarsra has made a request to exercise her right to be forgotten and delete her account and all posts related to her. Sarsra understands that this will erase her social media profile, but hopes that this will put an end to the interest in her whereabouts.
Opposing stakeholder: Beth Abraham — Beth Abraham is a historian that has begun studying what she has named the Breisand Effect. She has released several research papers on this psychological and sociological phenomenon and is in the process of writing another.

PAIR #3:

Data Subject: Angel Yoline — Angel Yoline is an up-and-coming actress who is excitedly promoting her new film on Tweeter. However, a while ago she posted views on Tweeter that characterize working class people in a negative light. Her former partner, Brad Schmidt, has recently higlighted these problematic views as a means of getting revenge on Angel. She now wishes for her own post, Brad’s post, and the public discourse about the controversy to be deleted.
Opposing stakeholder: Film Enthusiasts — Film enthusiasts have been excited about the new film that Angel Yoline is starring in. When an account exposes something that Angel Yoline said about working class people, there is a surge of new posts discussing her controversial opinions.

PAIR #4:

Data Subject: Frank Blimp — Frank Blimp is a recent divorcee who doesn’t have custody of his child from the marriage. Falling into hard times, Frank missed a couple of child support payments, which his ex-wife, Marge, has called him out on Tweeter. Even though both Frank and Marge have private accounts, these posts are visible to Frank’s friends and family. Since this tweet, Frank has been able to repay the missed payments and is financially stable enough to continue making future payments. He’s asked Marge to take down her tweets shaming him, but she has refused. Frank now requests his data to be erased in the hope that this will also make posts mentioning him disappear.
Opposing stakeholder: Marge Blimp — Marge is Frank’s ex-wife who has main custody of their child. After he had missed multiple child support payments, she resorted to using Tweeter to call him out. She wants others to be aware of Frank's past behavior.

PAIR #5:

Data Subject: Matt Bleat — Matt Bleat is a senior in high school who has just committed to Brown University after being recruited to their track and field team. Recently, there have been posts from Matt’s high school classmates calling attention to former allegations against Matt for sexual harassment. The high school determined that there was insufficient evidence to support those allegations and declined to take further action. Matt invokes the right to erasure, expecting that it will remove his posts, but also remove posts that mention hashtags related to the controversy.
Opposing stakeholder: The Center for Changing Our Campus Culture — They believe that posts highlighting the allegations against Matt should remain on Tweeter, and Matt should be prepared to deal with the possibility that Brown could discover these tweets and rescind their offer of admission.

The `GDPRDelete()` Function

The signature of GDPRDelete is as follows:

bool GDPRDelete(std::string& user_id);

In other words, the function receives a single argument, which is the user ID of the data subject who is invoking the right to erasure (i.e., your data subject stakeholder). The function can use this argument to look up data related to the data subject in the KVStore, or to find the user's identifier (e.g., "user_1") in other data.

In implementing GDPRDelete() you can consider a number of broad options for data store modification including, but not limited to:

Complete removal of data.
Modification of data (e.g. anonymization, modification, removal of parts of the content).
Making data harder to find, i.e. delisting or downranking (what might this mean in the context of this KVStore?).
Adding additional keys to the store.

Your GDPRDelete() is allowed to leave the KVStore in an inconsistent state (e.g., the key "user_1" is deleted from the store, but remains in the "all_users" list), but if this is the case, this should be clearly noted and justified.

Note that the function does not receive information about the context in which the deletion happens (e.g., what other users' data the data subject might want to delete, or what the other users' views on this are). Your design could extend the KVStore with auxiliary metadata that captures relevant context (e.g., special key-value pairs that indicate users who are of special interest, such as public figures), and GDPRDelete() may draw on this data. Using such metadata is not a requirement, however.

We will not grade you on the level of sophistication your GDPRDelete() function achieves, but rather on whether it works. As long as your written answers justify your choices, you will get full credit, even if the function itself is simple.

Before you write any code…

You might feel overwhelmed with the situation you’ve been presented with — we’ve engineered each situation such that the conflict is intentionally difficult to deal with. Because of this, we want you to consider the following questions before you touch any of the code. You don’t have to submit your answers for this, but we highly recommend writing down some notes for yourself.

Consider why the data subject would want their data (and potentially the others' data) to be deleted. What kind of ethically significant claims would support or reject their request?
Consider why the opposing stakeholder wouldn't want data to be deleted. What kind of ethically significant claims would support or reject their request?
What strategies or mechanisms could you propose to mitigate potential harm to the opposing stakeholder while still respecting the data subject's right to be forgotten?

Important note: we don’t expect you to find a solution that satisfies everyone — rather, the point is to make the most reasonable tradeoffs between the opposing parties’ claims.

Implementation Tasks

Task 1: Implement GDPRDelete() in client/simple_client.cpp.
Task 2: Leave detailed (header and/or inline) comments in your code that explain what kind of delete you are implementing.

GDPRDelete() should operate on any keys (such as user_ID, post_ID, etc) you deem necessary to implement your chosen strategy of deletion.

To test your GDPRDelete() function, you can use the gdprdelete <user> command in the client as follows:

$ ./simple_client 127.0.0.1:1234
gdprdelete user_10

You can then use print store in the server to see how your store contents have changed.

(Go back to Loading your KVStore with data if you need a reminder on how to start up your KVStore.)

After you’ve finished implementing delete…

The right to know and the right to delete affect all stakeholders of a database: its users, its operators, and those who use the data for technology, studies, and historical records. Now that you have implemented your version of privacy-conscious access and deletion, you should consider the potential pitfalls of your design choices in practice.

Task: Answer the following questions in your README file:

You may want to read the "How will you be graded" section below before beginning this task to understand what we are looking for!

Q1. Who was your stakeholder pair? (1 sentence)

Q2. What kind of delete did you implement and why? Explain your decisions as if you were reporting to an ethical auditor who will determine whether your design is justified. Highlight how your decisions address stakeholder concerns, and explain why you addressed their concerns in this way. If you purposely ignored any stakeholder concerns in your implementation, explain why they were not addressed (1-2 short paragraphs)

Q3. In the real world, Twitter, Google, Airbnb, and other companies write one implementation of GDPR-compliant account removal to realize the right to erasure. They do not write per-user functions! In the next part of the assignment, you'll think about how to generalize your implementation.

Answer the applicable question based on whether we assigned you one or two stakeholder pairs. Please mark which version of the question you are answering in your README. (1-3 short paragraphs)

Q3A. If we assigned you only one stakeholder pair: What are the shortcomings of your implementation? How might your approach to this assignment change if you were asked to create a general GDPRDelete() suitable for Right to be Forgotten requests from all of Tweeter's users? Consider specifically some other short and/or long term stakeholders within your scenario for whom your implementation might not be suitable.
Q3B. If you were assigned two stakeholder pairs: What are the shortcomings of your implementation? How might your approach to this assignment change if you were asked to create a general GDPRDelete() suitable for Right to be Forgotten requests from all of Tweeter's users? Consider specifically the concerns of your second pair of stakeholders, and whether your implementation of GDPRDelete() would be suitable for these stakeholders as well.

Along with considering additional stakeholders, you can also consider other outside entities who might be affected by your implementation of GDPRDelete().

How will you be graded?

Your implementation needs to work and do what you indicate it is doing in your comments.
You’re not being graded on whether you picked the “right” answer — there is no one correct version of delete for each stakeholder pair. Rather, the bulk of your grade is determined by how well you explain why you chose your specific implementation: the justifications you offer in support of your implementation should reflect a comprehensive assessment of the context and competing claims and weigh them against each other in a nuanced way.

A good response identifies the legitimate claims that each stakeholder may have, explains why those claims are important, and compares the importance of both claims. It provides concrete reasons for (fully or partially) prioritizing or rejecting individual stakeholders’ claims and how those trade-offs are reflected in the chosen implementation of the right to be forgotten. A good response, importantly, also touches upon the limitations of those choices.

Your response should show that you have thought about the interests of your stakeholders beyond the directly legal implications. That is, you must use some justification outside of the GDPR itself for the design of your implementation. Because GDPR requirements are defined rather loosely, you will have naturally made decisions beyond those required legally; a good response will highlight some of these decisions. You may also think about some of the values that motivate the GDPR requirements that you have drawn upon.

Handing In and Grading

Handin Instructions

As before, you will hand in your code using Git. In the kvstore/ subdirectory of your projects repository, you MUST fill in the text file called README-5A.md.

By 6:00pm on Friday, April 26th, you must have a commit with all steps completed.

Remind me again what the README should contain?

The README.md file will include the following:

Any design decisions you made and comments for graders, under “Design Overview”. If there’s nothing interesting to say, just list “None”.
Any collaborators and citations for help that you received, under "Collaborators". CS 300 encourages collaboration, but we expect you to acknowledge help you received, as is standard academic practice.
Your answers to the questions regarding your GDPRDelete() implementation.
Ranking the difficulty and time consumption of the project, compared to other projects so far. We won't use this for grading, but we collect the data to calibrate for future years.

Grading Breakdown

This project is graded out of 140 points, but will be weighted equally with the other projects in the course.

For the KVStore implementation parts:

18% (18 points) for passing the A1 tests (2 pts per test).
7% (7 points) for passing the A2 tests (1 pt per test).
40% (40 points) for passing the A3 tests (4 pts per test, and 4 pts for passing them all).
21% (21 points) for passing the A4 tests (3 pts per test).
14% (14 points) for passing the A5 (performance) tests (7 pts per test).

For the Tweeter part of the project:

25% (10 points) for your implementation of GDPRDelete.
25% (10 points) for comments and explanation of how your deletion function works.
50% (20 points) for answers to the four README questions.

Important note on using late hours:

You will be doing Project 5B in the same directory as Project 5A. We will set everyone's grading commit at the normal project deadline so that your Project 5B work won't incur late hours for your Project 5A submission.

However, if you are using late hours, please flag the commit you want graded so that we grade the right commit and assign the right number of late hours to Project 5A.

Now head to the grading server, make sure that you have the “KVStore” page configured correctly with your project repository, and check that your KVStore runs on the grading server as expected.

Congratulations, you’ve completed the first part of the fifth CS 300 project!

Acknowledgements: This project was developed for CS 300 by Richard Tang, Carolyn Zech, Cedric Sirianni, and Malte Schwarzkopf. The Tweeter portion of this project was developed for CS 300 by Eva Schiller, Eva Lau, Colton Rusch, and Malte Schwarzkopf.

Project 5A: Concurrent Store 🗂🗃🗄

Introduction

Learning Objectives

Assignment Installation

Stencil and Initial State

Compile and Run

Using the Client

Testing

Assignment

Step 1: Single-Threaded Simple KV Store

Step 2: Concurrent Simple KV Store

Implementation

Step 3: Single-Threaded Bucket-Based KV Store

Overview

Implementation

Step 4: Concurrent Bucket-Based KV Store

Overview

Readers-writer locks

Synchronization Considerations

Debugging with Sanitizers and GDB

Thread Sanitizers

GDB

Tweeter

Context

Loading your KVStore with data

Regulatory Background

Right to Access

What’s the difference?

Right to Delete:

What’s the difference?

Stakeholders

The GDPRDelete() Function

Before you write any code…

After you’ve finished implementing delete…

How will you be graded?

Handing In and Grading

Handin Instructions

Grading Breakdown

The `GDPRDelete()` Function