« Back to the main CS 0300 website
Project 5A: Concurrent Store 🗂🗃🗄
Due: Friday, April 26 at 6:00 PM EST
Introduction
A key-value store is a storage system in which data (a value) is addressed by a key; common examples are Python's dictionaries or Java/C++ hash maps. Key-value stores are popular for large-scale web applications because they are fast and scalable.
Clients interface with the key-value store through a server. Once client(s) connect to the server, they can issue requests to get, put, append, and delete key-value pairs from the store.
In this project, you'll implement two kinds of key-value stores. You'll then extend your implementation to support concurrent client requests.
Production applications often handle so much data (and so many client connections) that one key-value store is not enough – they need to divide (i.e., shard) their data between multiple stores. In Project 5B, you'll extend your implementation to support such a distributed system.
Learning Objectives
This assignment will help you:
- Familiarize yourself with synchronization primitives, namely mutexes.
- Think about problems introduced by concurrency, like deadlocks and race conditions on shared data structures.
- Learn about key networking abstractions, the "client" and "server," and the roles of each.
- Understand the implications of privacy laws for storage systems, and how developers need to implement technical functionality to give users rights over their data.
- Reason about trade-offs between stakeholder interests and public goods relating to privacy and data protection.
Assignment Installation
Ensure that your project repository has a handout remote. Type:
$ git remote show handout
If this reports an error, run:
$ git remote add handout https://github.com/csci0300/cs300-s24-projects.git
Then run:
$ git pull
$ git pull handout main
This will merge our Project 5 stencil code with your repository.
Once you have a local working copy of the repository that is up to date with our stencils, you are good to proceed.
Infrastructure Help
Stencil and Initial State
Project 5B also uses this stencil, so there is more code in the repository than you will need for this assignment. These are the directories you'll need for this assigment:
Directories |
Contents |
build/ |
See "Compile and Run". |
tests/ and test_utils/ |
See "Testing". |
client/ |
The simple_client implementation you'll run in this assignment, along with the commands it supports. Also has the stencil for shardkv_client , which you'll implement in Project 5B. |
gdpr/ |
The database.csv file you'll work with in the Tweeter portion of the assignment |
kvstore/ |
The definition of the abstract KvStore interface, as well as the SimpleKvStore and ConcurrentKvStore classes. |
server/ |
The definition of the KvServer class and support code. |
If you're curious, descriptions of the rest of the directories are below:
Directories |
Contents |
cmd/ |
The runner code that starts the client, server, and shardcontroller. You should not modify these. |
net/ |
The networking support code used by the clients and servers. You should not modify these. |
repl/ |
Utilities for running the client, server, and shardcontroller on the command line. You should not modify these. |
common/ |
The Shard class and some helpers for managing or operating on them. Also contains miscellaneous utility functions, such as color printing. These will both be useful in Project 5B. |
shardcontroller/ |
The definition of the shardcontroller class, and support code to process client requests. You'll implement the shardcontroller methods in Project 5B. |
Compile and Run
To compile, cd
into the build/
directory and run make
. This will create two executables: simple_client
andserver
.
To run the server, specify the port, and optionally the number of workers:
$ ./server <port> [n_workers]
To run the simple client, specify the server address:
$ ./simple_client <server hostname:port>
Example Usage
To connect multiple clients to a server, start by opening multiple terminals (or use tmux
).
Terminal 1: runs server
on port 5000 with the default number of workers:
$ ./server 5000
Your server will then be listening on <hostname>:5000
; the server's full address will be displayed in the terminal.
Terminal 2: runs simple_client
to connect to the server, using the address displayed above (in this example, the hostname is 65a4af2c28ae
):
$ ./simple_client 65a4af2c28ae:5000
The server defaults to using your SimpleKvStore. To change that, edit the line this->store = std::make_unique<SimpleKvStore>();
in server.cpp
to use a ConcurrentKvStore
.
Using the Client
Once connected to the server, you can use one or more clients to run standard key-value store operations, like get
, put
, append
, and delete
. You can also use two additional requests, multiget
and multiput
(for multiple key-value pairs). For specific usage instructions, type help
in the command prompt.
To exit the client (and any other executable), press Ctrl-D
(or Cmd-D
); this sends End-Of-File
(EOF
) to the program, causing it to exit cleanly. If your program hangs, use Ctrl-C
(or Cmd-C
) to force-quit the program.
Testing
We strongly recommend you test your code often while implementing, by compiling and running a server with a few clients (as shown above).
To run tests, you must be in the build/
directory. Then, run make check
.
To run the tests for an individual step #
, you can run make check A#
. For example, to run the tests for step 1, you can run make check A1
.
The tests are in the tests/
directory, with each subdirectory corresponding to a part of the assignment. After implementing each step, the testing harness will run all tests related to that step. Specifically, the steps will run all tests in the following directories:
- Step 1:
kvstore_sequential_tests
(A1)
- Step 2:
kvstore_parallel_tests
(A2)
- Step 3:
kvstore_sequential_tests
(A3)
- Step 4:
kvstore_parallel_tests
(A4) and kvstore_performance_tests
(A5)
If you're failing tests, your first step to debug should be to run them individually. Each test prints why it failed, but make check
suppresses that output.
To run an individual test, run that specific test's executable. For instance, for test_put_get_delete
, run in the build/
directory:
make test_put_get_delete
./test_put_get_delete
If nothing prints, the test passed. Otherwise, you'll see an error message.
Example Error Message
$ ./test_put_get_delete
Assert failed in file test_put_get_delete.cpp on line 26:
Expected first argument to equal second argument x1Ha4M0dLCkK
The error message directs us to this assertion:
ASSERT_EQ(get_res.value, vals[i]);
so the error message is telling us that get_res.value
(the first argument) is empty, but the test expected it to be equal to the second argument, vals[i]
, or x1Ha4M0dLCkK.
The correctness tests can also optionally specify which type of KVStore to run (SimpleKvStore
or ConcurrentKvStore
) by passing in either simple
or concurrent
as an argument:
./test_put_get_delete simple
The tests default to the ConcurrentKvStore
.
If you want to write your own tests, add new files to the subdirectories in tests/
. Be sure to take a look at the functions in test_utils/test_utils.hpp
–- they will make writing your own tests much easier.
Assignment
Step 1: Single-Threaded Simple KV Store
In this section, you'll implement a single-threaded simple key-value store. By "single-threaded," we mean that only one thread can safely operate on the KVStore at a time. This means that for Step 1, you should not worry about concurrency. You may safely assume that only one thread will ever be operating on your KVStore at a time.
Take a look at how we define a SimpleKvStore
(in kvstore/simple_kvstore.hpp
):
class SimpleKvStore : public KvStore {
public:
SimpleKvStore() = default;
~SimpleKvStore() = default;
bool Get(const GetRequest *req, GetResponse *res) override;
bool Put(const PutRequest *req, PutResponse *) override;
bool Append(const AppendRequest *req, AppendResponse *) override;
bool Delete(const DeleteRequest *req, DeleteResponse *res) override;
bool MultiGet(const MultiGetRequest *req, MultiGetResponse *res) override;
bool MultiPut(const MultiPutRequest *req, MultiPutResponse *) override;
std::vector<std::string> AllKeys() override;
private:
};
Task: Define your internal key-value store. What data structure might be useful here? You may find this reference page useful.
Now it's time to implement each of these methods. First, look at how we define the Request
and Response
structs in net/server_commands.hpp
. The Request
object contains the information being requested. The Response
object is where your methods should store the response to the request (if a response is needed). For the AllKeys
operation, return a vector of the keys. For the other operations, return true
if the operation is successful; otherwise, return false
.
Give me an example!
A call to Get(const GetRequest *req, GetResponse *res)
takes in a pointer to a GetRequest
, which has a key
field. Your method should get the value corresponding to key
from your internal key-value store and store it in the value
field of the GetResponse
that res
points to.
If a Response
struct is empty, that means you don't need to provide an explicit confirmation that you fulfilled the request. For example, for a call to Put(const PutRequest *req, PutResponse *res)
, you should insert the key
and value
from the PutRequest that req
points to into your internal key-value store. Since PutResponse
is an empty struct, you don't need to do anything with it.
Your key-value store will support the following operations:
Function |
Arguments |
Summary |
Get |
<key> |
Gets the value associated with <key> from the key-value store. Fails if it does not exist. |
Put |
<key> , <value> |
Sets <key> to <value> ; replaces existing values. |
Append |
<key> , <value> |
Appends <value> to the current value associated with <key> . Creates a new key-value pair if it does not already exist. |
Delete |
<key> |
Deletes <key> from the key-value store, returning the last saved value. Fails if it does not exist. |
MultiGet |
<key_1> ... <key_n> |
Gets the values associated with keys <key_1> ... <key_n> from the key-value store. Fails if any key does not exist. |
MultiPut |
<key_1> ... <key_n>, <value_1> ... <value_n> |
Sets <key_1> ... <key_n> to <value_1> ... <value_n> , respectively. |
AllKeys |
None |
Returns all of the keys from the database. |
Task: Implement the methods in kvstore/simple_kvstore.cpp
.
When you are finished, you should pass all kvstore_sequential_tests
. You can check that your implementation is correct by running in the build/
directory:
$ make check A1
If you're failing a test, you should run it individually to see debug output (replace test_name with the name of the test):
$ make test_name
$ ./test_name simple
The tests default to using the ConcurrentKvStore
(which you'll implement in Steps 3 and 4), so you'll need to explicitly tell each test to use the SimpleKvStore
by passing in simple
as an argument.
Step 2: Concurrent Simple KV Store
We would like our server to be able to handle multiple requests in parallel. However, our simple key-value store can't safely handle concurrent requests yet! In this step, you'll make your simple key-value store thread-safe.
Implementation
Task: Add a synchronization primitive to SimpleKvStore
in kvstore/simple_kvstore.hpp
to make it thread-safe. You should use at most one mutex in this part. (We will explore more efficient locking schemes in later steps.)
Task: Modify the methods in kvstore/simple_kvstore.cpp
to enforce synchronized access, using the primitive you defined in the previous task.
When you are finished, you should pass all kvstore_parallel_tests
. You can check that your implementation is correct by running in the build/
directory:
$ make check A2
If you're failing a test, you should run it individually to see debug output (replace test_name with the name of the test):
make test_name
./test_name simple
Step 3: Single-Threaded Bucket-Based KV Store
Overview
Operations on your simple key-value store are thread-safe and concurrent, but there is no actual parallelism. Since two threads cannot modify the store at the same time, it is thread-safe. However, the entire map gets locked every time a thread wants to perform an operation on the store, which means that there are no parallel (i.e. simultaneous) operations.
This single mutex protection strategy is an example of coarse-grained locking: one lock protects access to a large section of code or an entire data structure. Fine-grained locking attempts to reduce lock contention by splitting the data structure or code into smaller, potentially disjoint pieces that each have their own lock. Multiple threads can then operate on separate pieces in parallel.
However, we have to be careful that a fine-grained locking scheme doesn't introduce subtle concurrency bugs. Also, having too many mutexes can introduce overhead that offsets the benefits of the fine-grained locking strategy.
Example of Coarse vs. Fine-Grained locking
A data structure you could implement coarse or fine-grained locking would be an array.
Coarse-grained locking would imply locking the entire array when any of it needs to be modified. This means that even if only a small part of the array will be modified, no other threads can access any part of the array.
This would look something like having an array-specific mutex.
my_array_mtx.lock();
my_array[0] = 123;
my_array_mtx.unlock();
my_array_mtx.lock();
for(int i = 0; i < sizeof(my_array) / sizeof(int); i++){
my_array[i] = i + 3;
}
my_array_mtx.unlock();
Alternatively, you could have fine-grained locking. This would involve ensuring that, instead of the entire array being unavailable, only the specific components changed will be unavailable to other threads. Thus, every "slot" in the array would need a corresponding mutex. While this allows for greater flexibility for threads, this structure may be a lot heavier on the system (because one would need to create more mutexes).
This would look something like:
my_array_mtx_array[0].lock();
my_array[0] = 123;
my_array_mtx_array[0].unlock();
for(int i = 0; i < sizeof(my_array) / sizeof(int); i++){
my_array_mtx_array[i].lock();
my_array[i] = i + 3;
my_array_mtx_array[i].unlock();
}
Notice in the above example, the scalability isn't too great either – this is because we chose a too-fine grain 🌾 Thus, it's always important to find a balance between approaches.
Another very cool example of fine and coarse-grained locking is locking a linked list! Feel free to examine this link for more information.
Implementation
If the single mutex locking strategy from Step 2 is too coarse-grained, and one lock for each key-value pair is too fine-grained, how can we design an efficient and scalable KVStore? We need to revisit our internal store design to support a new locking scheme–one where there's more than one lock, but fewer locks than the number of keys.
For this part, you will implement the key-value store API as a bucket-based hash table.
Take a look at how we define a ConcurrentKvStore
(in kvstore/concurrent_kvstore.hpp
):
class ConcurrentKvStore : public KvStore {
public:
ConcurrentKvStore() = default;
~ConcurrentKvStore() = default;
bool Get(const GetRequest *req, GetResponse *res) override;
bool Put(const PutRequest *req, PutResponse *res) override;
bool Append(const AppendRequest *req, AppendResponse *res) override;
bool Delete(const DeleteRequest *req, DeleteResponse *res) override;
bool MultiGet(const MultiGetRequest *req, MultiGetResponse *res) override;
bool MultiPut(const MultiPutRequest *req, MultiPutResponse *res) override;
std::vector<std::string> AllKeys() override;
private:
DbMap store;
};
Notice that your concurrent key-value store will support the same operations as the SimpleKvStore
, so refer to Step 1 for descriptions of each method. Your internal store is defined as type DbMap
. We have already implemented the DbMap
struct for you. Your responsibility is to write the methods in kvstore/concurrent_kvstore.cpp
.
For this part, you may assume that only one thread will operate on your bucket-based KVStore at a time. We will add support for synchronized access in Step 4.
You are required to keep your internal store as type DbMap
, and to maintain the internal structure of the DbMap
(i.e., no changing this line):
std::array<std::list<DbItem>, BUCKET_COUNT> buckets;
However, you are not required to use getIfExists
, insertItem
, or removeItem
if you don't want to. We provide them because we think they will be helpful helper methods for you to work with your internal store, but if you'd rather write different ones (or do all of your work in kvstore/concurrent_kvstore.cpp
), feel free.
Task: Use the given DbMap
implementation to fill in the function bodies for Get
, Put
, Append
, Delete
, MultiGet
, MultiPut
, and AllKeys
in kvstore/concurrent_kvstore.cpp
.
When you are finished, you should pass all kvstore_sequential_tests
. You can check that your implementation is correct by running in the build/
directory:
make check A3
If you're failing a test, you should run it individually to see debug output (replace test_name with the name of the test):
make test_name
./test_name
For Steps 3 and 4, you can either specify concurrent
as an argument or just leave it blank (as in the example above).
Step 4: Concurrent Bucket-Based KV Store
Overview
In this part, you'll modify your bucket-based key-value store implementation to support concurrent and parallel access using readers-writer locks.
Readers-writer locks
Recall from class the basic rule of synchronization:
If two or more threads can concurrently access an object, and at least one of the accesses is a write, a race condition can occur and synchronization is required.
Observe that synchronization is only required if at least one thread is writing to shared data. If both threads are reading, then they can safely read concurrently. Regular mutexes, however, do not address this distinction–only one thread can access a critical section at a time, so such concurrent reads are impossible.
One potential optimization is a readers-writer lock, or RWLock (std::shared_mutex
in C++). RWLocks allow for concurrent access for read-only operations, but exclusive access for write operations. Before entering a critical section, threads may attempt to acquire either a shared (or read) lock or an exclusive (or write) lock. Multiple threads may hold a shared lock at once, but like regular mutexes, only one thread may hold an exclusive lock. To ensure safe concurrent or exclusive access, reader threads cannot acquire a shared lock until there is no writer thread holding an exclusive lock.
As with mutexes, it is up to the programmer to properly protect shared data and lock critical sections within the program. RWLocks present an additional challenge over mutexes: programmers must also ensure that threads acquire the correct type of lock before entering a critical section or potentially modifying data. For example, acquiring a shared lock and then modifying shared data is undefined behavior. As you implement key-value store functions, keep in mind which functions are amenable to concurrent access, and which require exclusive access.
Example of RWLocks
To acquire exclusive access to a critical section, use the lock and unlock functions. To acquire shared access, use lock_shared and unlock_shared:
shared_mtx.lock_shared();
shared_mtx.unlock_shared();
shared_mtx.lock();
shared_mtx.unlock();
See the std::shared_mutex
class for complete documentation.
In addition, like regular mutexes, C++ provides a wrapper class, std::shared_lock
, that automatically unlocks the shared mutex once it goes out of scope. This has the same API as std::unique_lock
:
void foo() {
std::shared_lock guard(shared_mtx);
}
Optional: Why not just use RWLocks all the time?
Especially in applications that experience read-heavy workloads, RWLocks permit more concurrency over regular mutexes. However, RWLocks should not be used as a drop-in replacement for mutexes: RWLocks come with additional overhead over simple mutexes, as they must contain additional state and logic to handle readers and writers. RWLocks also must avoid reader or writer starvation, where one operation is disproportionately favored. Wikipedia's article provides an introduction to potential RWLock implementations and problems.
Task: Add RWLocks to DbMap
to synchronize access to your buckets.
You may not use a single, global mutex. If you find yourself writing code like this:
global_mtx.lock();
global_mtx.unlock();
you are doing something wrong. You will not pass the performance tests, and may face further deductions during grading. The purpose of this step is to support thread-safe, parallel access to the store–a single lock prevents parallelism!
Hints
Think about a more fine-grained approach that makes concurrent access to your store more efficient! See Part 3 for an explanation of coarse-grained vs. fine-grained locking. How can you use multiple RW locks such that multiple threads can operate on your store in parallel?
Synchronization Considerations
Now, it's time to use the primitives you defined to synchronize access to your key-value store. Here are two hints before you get started:
-
Enforce a consistent locking order. For example, if a given operation requires lock1
and lock2
, you should enforce that every operation always attempts to lock lock1
before lock2
(or vice versa). Think about why this strategy is necessary to prevent deadlock!
-
Acquire all relevant locks before accessing or modifying the store. We want each operation in our key-value store to be atomic, meaning that they are each executed as a single, indivisible unit. In this context of a key-value store, other operations should not modify relevant buckets while the MultiGet
, MultiPut
, and AllKeys
operations are running.
Consider two concurrent operations: one MultiGet that requires lock1
and lock2
, and one Delete that requires lock1
. Imagine the following execution order:
- The MultiGet acquires
lock1
, gets value1
associated with key1
, then releases lock1
.
- The Delete acquires
lock1
, deletes the key1: value1
entry, then releases lock1
.
- The MultiGet acquires
lock2
, gets some values, than releases lock2
.
When the MultiGet returns, it will returns a value (value1
) that is no longer in the store. We can prevent this bug by ensuring that the MultiGet owns both lock1
and lock2
before proceeding, and does not release either until it has finished entirely.
Task: Modify the methods in kvstore/concurrent_kvstore.cpp
to enforce synchronized access, using the primitives you defined in the previous task.
When you are finished, you should pass all kvstore_parallel_tests
. You can check that your implementation is correct by running in the build/
directory:
$ make check A4
If you implemented fine-grained locking correctly, you should also pass the performance tests. You can run the performance tests by running in the build/
directory:
$ make perf
Your multi-threaded solution with fine-grained locking needs to be at least 2x faster than (i.e., achieve twice the throughput of) the single-threaded, sequential implementation.
At this point, your implementation should pass all of the Part A tests! You can test this by running in the build/
directory:
$ ./run_tests concurrent_store
or
$ make check concurrent_store
Debugging with Sanitizers and GDB
Debugging concurrent code can be trickier than debugging single-threaded code. As threads interact with the shared data structures, it can be difficult to track down data races and deadlocks. Luckily, GDB and thread sanitizers can be really helpful.
Thread Sanitizers
Google’s ThreadSanitizer for C and C++ can help you detect data races, deadlocks, and thread leaks. To run the code with thread sanitizers, compile it with:
make TSAN=1
When compiling tests, you will need to specify TSAN=1
too:
make check TSAN=1
or
./test.sh TSAN=1 <A1|A2|A3|A4|A5>
Example
Consider the following example:
int count = 0;
int main() {
std::thread p1(child);
std::thread p2(child);
p1.join();
p2.join();
printf("count %d\n", count);
}
int child() {
while(count < 100) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
count += 1;
}
return 0;
}
When you run this code using a thread sanitizer, it will produce the following output:
Note: Open the image in a new tab for a larger image.
- The red message indicates what error is occurring (in this case, a data race)
- The blue messages indicate conflicting memory accesses with a stack trace. In our case, the conflict occurs on line 25 and line 27. It includes which thread is performing each access, and if the thread holds a mutex it will also indicate it.
- Below these messages, the sanitizer might also indicate:
-
- specific mutexes, when they were aquired, which threads acquired them, and sometimes when they were created
-
- threads, and when they were created
- Refer to this ThreadSanitizerReportFormat for more information on the report format.
GDB
Sometimes if your program hangs due to a deadlock, the thread sanitizer may not be able to return any reports. In these cases, gdb has some more helpful tools. Here are some useful gdb commands:
info threads
: prints out existing threads, the Light Weight Processor (LWP) they’re running on, and the line of code they’re executing. The current thread will have an asterisk next to it.
3 Thread 0x7ffff3bff700 (LWP 9167) "example" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
4 Thread 0x7ffff33fe700 (LWP 9003) "example" __lll_lock_wait ()
at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
thread <thread-id>
: switches to thread <thread-id>
(gdb) thread 4
[Switching to thread 4 (Thread 0x7ffff33fe700 (LWP 9003))]
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
p <mutex>
: Prints the fields of a mutex. Helpful for determining the owner of the lock identified by its Light Weight Processor.
CTRL+Z
: Stops the execution of the program. If your program is hanging likely due to a deadlock, this allows you to examine the current state of the program. For example, if one of your threads is blocked on a mutex, you can switch to that thread, see which thread owns the mutex, and then switch to that thread.
bt
: Prints out a stacktrace. Often times, threads are blocked on a system call (locking a mutex, waiting on a condition variable, or reading/writing to a file descriptor). Backtrace is useful for discovering where in the codebase the blocking call was called.
f <frame-number>
: Switches to the provided frame number. Allows you to switch to frames within files you’re familiar with and determine which function called blocking system calls.
Context
Now that you've implemented a simple key-value store, let's build upon it with a use case!
In this part of the assignment, you’ll be working with a specialized and more complicated key-value store for a new social media platform, Tweeter. On Tweeter, users can choose their usernames, write posts that appear on their profiles, and respond to other users’ posts (which appear on both users’ profiles).
In Tweeter’s database, there are five kinds of key-value pairs.
Key Value Pair Structure |
Example Return Value |
user_id → username |
“user_14” → “malte” |
post_id → post content |
“post_59” → “Hello, Tweeter!” |
all_users → comma-separated list of user_ids |
“all_users” → “user_13,user_14,user_160” |
user_id_posts → comma-separated list of post_ids that user has posted |
“user_14_posts” → “post_59,post_1” |
post_id_replies → comma-separated list of post_ids that respond to post |
“post_59_replies” → “post_60, post_61” |
Even though there can be multiple users with the same usernames, every user_id
is unique. The same goes for posts: even though there can be multiple posts with the same text, every post has a unique post_id
.
Loading your KVStore with data
We provide you with the data stored on Tweeter's instance of KVStore here. The same data is also in the gdpr/database.csv
, and you can load it into your KVStore as follows.
In one terminal, run this command from the build
directory to start a KVStore server listening on port 1234 (you can pick any number for the port, it just sets up a rendevous point for your client to connect with the server):
build$ ./server 1234
In another terminal, run the client and feed in the data:
build$ ./simple_client 127.0.0.1:1234 < ../gdpr/database.csv
After running this command, typing print store
into the first terminal (which runs the server) should show that your KVStore contains our dataset. Note that since the KVStore is an in-memory store, you will need to re-load the dataset every time you restart the server, as it loses its contents on shutdown.
Now, a client can connect to the server and make API requests; for instance, to fetch user_1
's data from the server, start up a new client and make a Get
request:
build$ ./simple_client 127.0.0.1:1234
get user_1
Regulatory Background
When implementing a real-world storage system handling personal data like Tweeter, it is essential to understand the legal frameworks that govern data privacy. In this section of the assignment, you will implement features pertaining to privacy regulations, with a particular focus on the European Union's General Data Protection Regulation (GDPR). This regulation is designed to ensure that users have agency over their data, including the right to both access and delete their information online.
Task: Please read the specific requirements of the GDPR at the following links.
What about the United States?
The US is moving slower in data protection laws than the EU, but some states have created GDPR-like regulations such as the California Consumer Privacy Act(CCPA).
You can read about its equivalent policies here:
In a nutshell… 🥜
Right to Access
GDPR Right of Access (Article 15) – Users can receive a copy of their own data upon request, as well as the source of any data that was not collected from the subject. When relevant, they can also request: the reason their data has been processed, the categories of data being processed, who can access this data, and the duration for which their data will be stored. They must also be informed when their data is transferred to a third party.
And the United States?
CCPA Right to Know – Users can request access to personal data from companies, as well as: the categories of their data, the categories of sources, the purpose of collecting that data, and information about what categories of data are being transferred to what types of third parties.
What’s the difference?
Not much – for our application they share broadly similar data access rights!
Right to Delete:
GDPR Right to be Forgotten (Article 17) – Users can request the deletion of the personal data under six specific circumstances, including when it is no longer needed for the purpose it was collected, or the user has revoked consent. Companies are not required to comply if the data is essential for: avoiding infringing upon freedom of expression, complying with legal requirements, engaging in scientific or historical research, or reasons in the interest of public health.
And the United States?
CCPA Right to Delete – Users can request the deletion of all personal data at any time. Companies are not required to comply if the data is essential for: fulfilling the purpose for which it was collected, ensuring security and integrity, identifying and repairing bugs in functionality, avoiding infringing upon the rights of another consumer (e.g. to free speech), engaging in scientific, historical, or statistical research, “enabling solely internal uses that are reasonably aligned with the expectations of the consumer,” or complying with legal requirements.
What’s the difference?
The primary difference is that “the GDPR right only applies if the request meets one of six specific conditions while the CCPA right is broad. However, the CCPA also allows business to refuse the request on much broader grounds than the GDPR” (Practical Law).
Stakeholders
Tweeters has users who are in the EU, so Tweeter must comply with the GDPR’s right to access and the right to be forgotten. You are in charge of making a decision on how to handle a particular user's request to exercise their right to access and their right to be forgotten.
Task: Check EdStem! You will find a pinned post containing information about which stakeholder pair (or pairs) you have been assigned to complete this portion of the assignment.
If you have been assigned two stakeholder pairs, please use the first one for all parts of the assignment. You will use the second pair to answer question 3B later!
You will write a function GDPRDelete()
that performs a delete request for your assigned stakeholder. There are several ways one might implement privacy-conscious deletion (think about what we discussed in section for ideas!). While many different kinds of delete can be acceptable for your stakeholder pair, you cannot opt out of implementing some form of delete and reject their request, although your implementation may not meet all of their hopes and expectations. (In the real world, there are cases where deletion requests have been rejected completely, but for the scope of this assignment, we’ve selected the dataset and the stakeholders such that you should be able to implement some kind of delete, even if it is an unsatisfactory one for your stakeholder.)
The caveat with this particular assignment is that you are also accountable to an ethical auditor who is charged with scrutinizing your company’s handling of GDPR compliance. Therefore, when satisfying your stakeholder’s request for deletion in this assignment, you must also consider other stakeholders that could be affected by this deletion and whose legitimate claims and interests might be infringed upon.
The following introduces each stakeholder pair and explains the context in which the deletion request occurs:
PAIR #1:
- Data Subject: Congressperson Kirby — Congressperson Kirby is currently the elected congressperson for Rhode Island’s first district and is running for reelection in the current election cycle. Recently, some of their tweets from when they were in college have resurfaced. These tweets were written 10 years ago concerning a pandemic that occurred at the time. While the congressperson has issued apologies, this has not been enough to stop the ongoing discourse from users all across the political spectrum. Because of this, the congressperson has made a request to exercise their right to be forgotten. They wish to delete their account, which they hope will delete their original tweets and related public discourse (other users quoting or rewording the original posts) about the controversy.
- Opposing stakeholder: Freedom House Advocacy Group — This advocacy group is concerned with government accountability & transparency. They oppose Kirby’s attempt to remove evidence of their controversy from the past.
PAIR #2:
- Data Subject: Sarsra Breisand — Sarsra Breisand is an American singer, actress and director. With a career spanning over six decades, she has achieved success in multiple fields of entertainment, and is among the few performers awarded an Emmy, Grammy, Oscar, and Tony (EGOT). Recently, a paparazzi reporter leaked information about where Sarsra Breisand lives. Despite her attempts to hide this information, she draws more attention to this leaked information. As a desperate final attempt to protect her privacy, Sarsra has made a request to exercise her right to be forgotten and delete her account and all posts related to her. Sarsra understands that this will erase her social media profile, but hopes that this will put an end to the interest in her whereabouts.
- Opposing stakeholder: Beth Abraham — Beth Abraham is a historian that has begun studying what she has named the Breisand Effect. She has released several research papers on this psychological and sociological phenomenon and is in the process of writing another.
PAIR #3:
- Data Subject: Angel Yoline — Angel Yoline is an up-and-coming actress who is excitedly promoting her new film on Tweeter. However, a while ago she posted views on Tweeter that characterize working class people in a negative light. Her former partner, Brad Schmidt, has recently higlighted these problematic views as a means of getting revenge on Angel. She now wishes for her own post, Brad’s post, and the public discourse about the controversy to be deleted.
- Opposing stakeholder: Film Enthusiasts — Film enthusiasts have been excited about the new film that Angel Yoline is starring in. When an account exposes something that Angel Yoline said about working class people, there is a surge of new posts discussing her controversial opinions.
PAIR #4:
- Data Subject: Frank Blimp — Frank Blimp is a recent divorcee who doesn’t have custody of his child from the marriage. Falling into hard times, Frank missed a couple of child support payments, which his ex-wife, Marge, has called him out on Tweeter. Even though both Frank and Marge have private accounts, these posts are visible to Frank’s friends and family. Since this tweet, Frank has been able to repay the missed payments and is financially stable enough to continue making future payments. He’s asked Marge to take down her tweets shaming him, but she has refused. Frank now requests his data to be erased in the hope that this will also make posts mentioning him disappear.
- Opposing stakeholder: Marge Blimp — Marge is Frank’s ex-wife who has main custody of their child. After he had missed multiple child support payments, she resorted to using Tweeter to call him out. She wants others to be aware of Frank's past behavior.
PAIR #5:
- Data Subject: Matt Bleat — Matt Bleat is a senior in high school who has just committed to Brown University after being recruited to their track and field team. Recently, there have been posts from Matt’s high school classmates calling attention to former allegations against Matt for sexual harassment. The high school determined that there was insufficient evidence to support those allegations and declined to take further action. Matt invokes the right to erasure, expecting that it will remove his posts, but also remove posts that mention hashtags related to the controversy.
- Opposing stakeholder: The Center for Changing Our Campus Culture — They believe that posts highlighting the allegations against Matt should remain on Tweeter, and Matt should be prepared to deal with the possibility that Brown could discover these tweets and rescind their offer of admission.
The GDPRDelete()
Function
The signature of GDPRDelete
is as follows:
bool GDPRDelete(std::string& user_id);
In other words, the function receives a single argument, which is the user ID of the data subject who is invoking the right to erasure (i.e., your data subject stakeholder). The function can use this argument to look up data related to the data subject in the KVStore, or to find the user's identifier (e.g., "user_1") in other data.
In implementing GDPRDelete()
you can consider a number of broad options for data store modification including, but not limited to:
- Complete removal of data.
- Modification of data (e.g. anonymization, modification, removal of parts of the content).
- Making data harder to find, i.e. delisting or downranking (what might this mean in the context of this KVStore?).
- Adding additional keys to the store.
Your GDPRDelete()
is allowed to leave the KVStore in an inconsistent state (e.g., the key "user_1" is deleted from the store, but remains in the "all_users" list), but if this is the case, this should be clearly noted and justified.
Note that the function does not receive information about the context in which the deletion happens (e.g., what other users' data the data subject might want to delete, or what the other users' views on this are). Your design could extend the KVStore with auxiliary metadata that captures relevant context (e.g., special key-value pairs that indicate users who are of special interest, such as public figures), and GDPRDelete()
may draw on this data. Using such metadata is not a requirement, however.
We will not grade you on the level of sophistication your GDPRDelete()
function achieves, but rather on whether it works. As long as your written answers justify your choices, you will get full credit, even if the function itself is simple.
Before you write any code…
You might feel overwhelmed with the situation you’ve been presented with — we’ve engineered each situation such that the conflict is intentionally difficult to deal with. Because of this, we want you to consider the following questions before you touch any of the code. You don’t have to submit your answers for this, but we highly recommend writing down some notes for yourself.
- Consider why the data subject would want their data (and potentially the others' data) to be deleted. What kind of ethically significant claims would support or reject their request?
- Consider why the opposing stakeholder wouldn't want data to be deleted. What kind of ethically significant claims would support or reject their request?
- What strategies or mechanisms could you propose to mitigate potential harm to the opposing stakeholder while still respecting the data subject's right to be forgotten?
Important note: we don’t expect you to find a solution that satisfies everyone — rather, the point is to make the most reasonable tradeoffs between the opposing parties’ claims.
Implementation Tasks
Task 1: Implement GDPRDelete()
in client/simple_client.cpp
.
Task 2: Leave detailed (header and/or inline) comments in your code that explain what kind of delete you are implementing.
GDPRDelete()
should operate on any keys (such as user_ID
, post_ID
, etc) you deem necessary to implement your chosen strategy of deletion.
To test your GDPRDelete()
function, you can use the gdprdelete <user>
command in the client as follows:
$ ./simple_client 127.0.0.1:1234
gdprdelete user_10
You can then use print store
in the server to see how your store contents have changed.
(Go back to Loading your KVStore with data if you need a reminder on how to start up your KVStore.)
After you’ve finished implementing delete…
The right to know and the right to delete affect all stakeholders of a database: its users, its operators, and those who use the data for technology, studies, and historical records. Now that you have implemented your version of privacy-conscious access and deletion, you should consider the potential pitfalls of your design choices in practice.
Task: Answer the following questions in your README file:
You may want to read the "How will you be graded" section below before beginning this task to understand what we are looking for!
Q1. Who was your stakeholder pair? (1 sentence)
Q2. What kind of delete did you implement and why? Explain your decisions as if you were reporting to an ethical auditor who will determine whether your design is justified. Highlight how your decisions address stakeholder concerns, and explain why you addressed their concerns in this way. If you purposely ignored any stakeholder concerns in your implementation, explain why they were not addressed (1-2 short paragraphs)
Q3. In the real world, Twitter, Google, Airbnb, and other companies write one implementation of GDPR-compliant account removal to realize the right to erasure. They do not write per-user functions! In the next part of the assignment, you'll think about how to generalize your implementation.
Answer the applicable question based on whether we assigned you one or two stakeholder pairs. Please mark which version of the question you are answering in your README. (1-3 short paragraphs)
-
Q3A. If we assigned you only one stakeholder pair: What are the shortcomings of your implementation? How might your approach to this assignment change if you were asked to create a general GDPRDelete()
suitable for Right to be Forgotten requests from all of Tweeter's users? Consider specifically some other short and/or long term stakeholders within your scenario for whom your implementation might not be suitable.
-
Q3B. If you were assigned two stakeholder pairs: What are the shortcomings of your implementation? How might your approach to this assignment change if you were asked to create a general GDPRDelete()
suitable for Right to be Forgotten requests from all of Tweeter's users? Consider specifically the concerns of your second pair of stakeholders, and whether your implementation of GDPRDelete()
would be suitable for these stakeholders as well.
Along with considering additional stakeholders, you can also consider other outside entities who might be affected by your implementation of GDPRDelete()
.
How will you be graded?
- Your implementation needs to work and do what you indicate it is doing in your comments.
- You’re not being graded on whether you picked the “right” answer — there is no one correct version of delete for each stakeholder pair. Rather, the bulk of your grade is determined by how well you explain why you chose your specific implementation: the justifications you offer in support of your implementation should reflect a comprehensive assessment of the context and competing claims and weigh them against each other in a nuanced way.
A good response identifies the legitimate claims that each stakeholder may have, explains why those claims are important, and compares the importance of both claims. It provides concrete reasons for (fully or partially) prioritizing or rejecting individual stakeholders’ claims and how those trade-offs are reflected in the chosen implementation of the right to be forgotten. A good response, importantly, also touches upon the limitations of those choices.
Your response should show that you have thought about the interests of your stakeholders beyond the directly legal implications. That is, you must use some justification outside of the GDPR itself for the design of your implementation. Because GDPR requirements are defined rather loosely, you will have naturally made decisions beyond those required legally; a good response will highlight some of these decisions. You may also think about some of the values that motivate the GDPR requirements that you have drawn upon.
Handing In and Grading
Handin Instructions
As before, you will hand in your code using Git. In the kvstore/
subdirectory of your projects repository, you MUST fill in the text file called README-5A.md
.
By 6:00pm on Friday, April 26th, you must have a commit with all steps completed.
Remind me again what the README should contain?
The README.md file will include the following:
- Any design decisions you made and comments for graders, under “Design Overview”. If there’s nothing interesting to say, just list “None”.
- Any collaborators and citations for help that you received, under "Collaborators". CS 300 encourages collaboration, but we expect you to acknowledge help you received, as is standard academic practice.
- Your answers to the questions regarding your
GDPRDelete()
implementation.
- Ranking the difficulty and time consumption of the project, compared to other projects so far. We won't use this for grading, but we collect the data to calibrate for future years.
Grading Breakdown
This project is graded out of 140 points, but will be weighted equally with the other projects in the course.
For the KVStore implementation parts:
- 18% (18 points) for passing the A1 tests (2 pts per test).
- 7% (7 points) for passing the A2 tests (1 pt per test).
- 40% (40 points) for passing the A3 tests (4 pts per test, and 4 pts for passing them all).
- 21% (21 points) for passing the A4 tests (3 pts per test).
- 14% (14 points) for passing the A5 (performance) tests (7 pts per test).
For the Tweeter part of the project:
- 25% (10 points) for your implementation of
GDPRDelete
.
- 25% (10 points) for comments and explanation of how your deletion function works.
- 50% (20 points) for answers to the four README questions.
Important note on using late hours:
You will be doing Project 5B in the same directory as Project 5A. We will set everyone's grading commit at the normal project deadline so that your Project 5B work won't incur late hours for your Project 5A submission.
However, if you are using late hours, please flag the commit you want graded so that we grade the right commit and assign the right number of late hours to Project 5A.
Now head to the grading server, make sure that you have the “KVStore” page configured correctly with your project repository, and check that your KVStore runs on the grading server as expected.
Congratulations, you’ve completed the first part of the fifth CS 300 project!
Acknowledgements: This project was developed for CS 300 by Richard Tang, Carolyn Zech, Cedric Sirianni, and Malte Schwarzkopf. The Tweeter portion of this project was developed for CS 300 by Eva Schiller, Eva Lau, Colton Rusch, and Malte Schwarzkopf.
« Back to the main CS 0300 website
Project 5A: Concurrent Store 🗂🗃🗄
Due: Friday, April 26 at 6:00 PM EST
Introduction
A key-value store is a storage system in which data (a value) is addressed by a key; common examples are Python's dictionaries or Java/C++ hash maps. Key-value stores are popular for large-scale web applications because they are fast and scalable.
Clients interface with the key-value store through a server. Once client(s) connect to the server, they can issue requests to get, put, append, and delete key-value pairs from the store.
In this project, you'll implement two kinds of key-value stores. You'll then extend your implementation to support concurrent client requests.
Production applications often handle so much data (and so many client connections) that one key-value store is not enough – they need to divide (i.e., shard) their data between multiple stores. In Project 5B, you'll extend your implementation to support such a distributed system.
Learning Objectives
This assignment will help you:
Assignment Installation
Ensure that your project repository has a handout remote. Type:
If this reports an error, run:
Then run:
This will merge our Project 5 stencil code with your repository.
Once you have a local working copy of the repository that is up to date with our stencils, you are good to proceed.
Infrastructure Help
Stencil and Initial State
Project 5B also uses this stencil, so there is more code in the repository than you will need for this assignment. These are the directories you'll need for this assigment:
build/
tests/
andtest_utils/
client/
simple_client
implementation you'll run in this assignment, along with the commands it supports. Also has the stencil forshardkv_client
, which you'll implement in Project 5B.gdpr/
database.csv
file you'll work with in the Tweeter portion of the assignmentkvstore/
KvStore
interface, as well as theSimpleKvStore
andConcurrentKvStore
classes.server/
KvServer
class and support code.If you're curious, descriptions of the rest of the directories are below:
cmd/
net/
repl/
common/
Shard
class and some helpers for managing or operating on them. Also contains miscellaneous utility functions, such as color printing. These will both be useful in Project 5B.shardcontroller/
shardcontroller
class, and support code to process client requests. You'll implement the shardcontroller methods in Project 5B.Compile and Run
To compile,
cd
into thebuild/
directory and runmake
. This will create two executables:simple_client
andserver
.To run the server, specify the port, and optionally the number of workers:
To run the simple client, specify the server address:
Example Usage
To connect multiple clients to a server, start by opening multiple terminals (or use
tmux
).Terminal 1: runs
server
on port 5000 with the default number of workers:Your server will then be listening on
<hostname>:5000
; the server's full address will be displayed in the terminal.Terminal 2: runs
simple_client
to connect to the server, using the address displayed above (in this example, the hostname is65a4af2c28ae
):The server defaults to using your SimpleKvStore. To change that, edit the line
this->store = std::make_unique<SimpleKvStore>();
inserver.cpp
to use aConcurrentKvStore
.Using the Client
Once connected to the server, you can use one or more clients to run standard key-value store operations, like
get
,put
,append
, anddelete
. You can also use two additional requests,multiget
andmultiput
(for multiple key-value pairs). For specific usage instructions, typehelp
in the command prompt.To exit the client (and any other executable), press
Ctrl-D
(orCmd-D
); this sendsEnd-Of-File
(EOF
) to the program, causing it to exit cleanly. If your program hangs, useCtrl-C
(orCmd-C
) to force-quit the program.Testing
We strongly recommend you test your code often while implementing, by compiling and running a server with a few clients (as shown above).
To run tests, you must be in the
build/
directory. Then, runmake check
.To run the tests for an individual step
#
, you can runmake check A#
. For example, to run the tests for step 1, you can runmake check A1
.The tests are in the
tests/
directory, with each subdirectory corresponding to a part of the assignment. After implementing each step, the testing harness will run all tests related to that step. Specifically, the steps will run all tests in the following directories:kvstore_sequential_tests
(A1)kvstore_parallel_tests
(A2)kvstore_sequential_tests
(A3)kvstore_parallel_tests
(A4) andkvstore_performance_tests
(A5)If you're failing tests, your first step to debug should be to run them individually. Each test prints why it failed, but
make check
suppresses that output.To run an individual test, run that specific test's executable. For instance, for
test_put_get_delete
, run in thebuild/
directory:If nothing prints, the test passed. Otherwise, you'll see an error message.
Example Error Message
The error message directs us to this assertion:
so the error message is telling us that
get_res.value
(the first argument) is empty, but the test expected it to be equal to the second argument,vals[i]
, or x1Ha4M0dLCkK.The correctness tests can also optionally specify which type of KVStore to run (
SimpleKvStore
orConcurrentKvStore
) by passing in eithersimple
orconcurrent
as an argument:The tests default to the
ConcurrentKvStore
.If you want to write your own tests, add new files to the subdirectories in
tests/
. Be sure to take a look at the functions intest_utils/test_utils.hpp
–- they will make writing your own tests much easier.Assignment
Step 1: Single-Threaded Simple KV Store
In this section, you'll implement a single-threaded simple key-value store. By "single-threaded," we mean that only one thread can safely operate on the KVStore at a time. This means that for Step 1, you should not worry about concurrency. You may safely assume that only one thread will ever be operating on your KVStore at a time.
Take a look at how we define a
SimpleKvStore
(inkvstore/simple_kvstore.hpp
):Task: Define your internal key-value store. What data structure might be useful here? You may find this reference page useful.
Now it's time to implement each of these methods. First, look at how we define the
Request
andResponse
structs innet/server_commands.hpp
. TheRequest
object contains the information being requested. TheResponse
object is where your methods should store the response to the request (if a response is needed). For theAllKeys
operation, return a vector of the keys. For the other operations, returntrue
if the operation is successful; otherwise, returnfalse
.Give me an example!
A call to
Get(const GetRequest *req, GetResponse *res)
takes in a pointer to aGetRequest
, which has akey
field. Your method should get the value corresponding tokey
from your internal key-value store and store it in thevalue
field of theGetResponse
thatres
points to.If a
Response
struct is empty, that means you don't need to provide an explicit confirmation that you fulfilled the request. For example, for a call toPut(const PutRequest *req, PutResponse *res)
, you should insert thekey
andvalue
from the PutRequest thatreq
points to into your internal key-value store. SincePutResponse
is an empty struct, you don't need to do anything with it.Your key-value store will support the following operations:
Get
<key>
<key>
from the key-value store. Fails if it does not exist.Put
<key>
,<value>
<key>
to<value>
; replaces existing values.Append
<key>
,<value>
<value>
to the current value associated with<key>
. Creates a new key-value pair if it does not already exist.Delete
<key>
<key>
from the key-value store, returning the last saved value. Fails if it does not exist.MultiGet
<key_1> ... <key_n>
<key_1> ... <key_n>
from the key-value store. Fails if any key does not exist.MultiPut
<key_1> ... <key_n>, <value_1> ... <value_n>
<key_1> ... <key_n>
to<value_1> ... <value_n>
, respectively.AllKeys
Task: Implement the methods in
kvstore/simple_kvstore.cpp
.When you are finished, you should pass all
kvstore_sequential_tests
. You can check that your implementation is correct by running in thebuild/
directory:If you're failing a test, you should run it individually to see debug output (replace test_name with the name of the test):
The tests default to using the
ConcurrentKvStore
(which you'll implement in Steps 3 and 4), so you'll need to explicitly tell each test to use theSimpleKvStore
by passing insimple
as an argument.Step 2: Concurrent Simple KV Store
We would like our server to be able to handle multiple requests in parallel. However, our simple key-value store can't safely handle concurrent requests yet! In this step, you'll make your simple key-value store thread-safe.
Implementation
Task: Add a synchronization primitive to
SimpleKvStore
inkvstore/simple_kvstore.hpp
to make it thread-safe. You should use at most one mutex in this part. (We will explore more efficient locking schemes in later steps.)Task: Modify the methods in
kvstore/simple_kvstore.cpp
to enforce synchronized access, using the primitive you defined in the previous task.When you are finished, you should pass all
kvstore_parallel_tests
. You can check that your implementation is correct by running in thebuild/
directory:If you're failing a test, you should run it individually to see debug output (replace test_name with the name of the test):
Step 3: Single-Threaded Bucket-Based KV Store
Overview
Operations on your simple key-value store are thread-safe and concurrent, but there is no actual parallelism. Since two threads cannot modify the store at the same time, it is thread-safe. However, the entire map gets locked every time a thread wants to perform an operation on the store, which means that there are no parallel (i.e. simultaneous) operations.
This single mutex protection strategy is an example of coarse-grained locking: one lock protects access to a large section of code or an entire data structure. Fine-grained locking attempts to reduce lock contention by splitting the data structure or code into smaller, potentially disjoint pieces that each have their own lock. Multiple threads can then operate on separate pieces in parallel.
However, we have to be careful that a fine-grained locking scheme doesn't introduce subtle concurrency bugs. Also, having too many mutexes can introduce overhead that offsets the benefits of the fine-grained locking strategy.
Example of Coarse vs. Fine-Grained locking
A data structure you could implement coarse or fine-grained locking would be an array.
Coarse-grained locking would imply locking the entire array when any of it needs to be modified. This means that even if only a small part of the array will be modified, no other threads can access any part of the array.
This would look something like having an array-specific mutex.
Alternatively, you could have fine-grained locking. This would involve ensuring that, instead of the entire array being unavailable, only the specific components changed will be unavailable to other threads. Thus, every "slot" in the array would need a corresponding mutex. While this allows for greater flexibility for threads, this structure may be a lot heavier on the system (because one would need to create more mutexes).
This would look something like:
Notice in the above example, the scalability isn't too great either – this is because we chose a too-fine grain 🌾 Thus, it's always important to find a balance between approaches.
Another very cool example of fine and coarse-grained locking is locking a linked list! Feel free to examine this link for more information.
Implementation
If the single mutex locking strategy from Step 2 is too coarse-grained, and one lock for each key-value pair is too fine-grained, how can we design an efficient and scalable KVStore? We need to revisit our internal store design to support a new locking scheme–one where there's more than one lock, but fewer locks than the number of keys.
For this part, you will implement the key-value store API as a bucket-based hash table.
Take a look at how we define a
ConcurrentKvStore
(inkvstore/concurrent_kvstore.hpp
):Notice that your concurrent key-value store will support the same operations as the
SimpleKvStore
, so refer to Step 1 for descriptions of each method. Your internal store is defined as typeDbMap
. We have already implemented theDbMap
struct for you. Your responsibility is to write the methods inkvstore/concurrent_kvstore.cpp
.For this part, you may assume that only one thread will operate on your bucket-based KVStore at a time. We will add support for synchronized access in Step 4.
You are required to keep your internal store as type
DbMap
, and to maintain the internal structure of theDbMap
(i.e., no changing this line):However, you are not required to use
getIfExists
,insertItem
, orremoveItem
if you don't want to. We provide them because we think they will be helpful helper methods for you to work with your internal store, but if you'd rather write different ones (or do all of your work inkvstore/concurrent_kvstore.cpp
), feel free.Task: Use the given
DbMap
implementation to fill in the function bodies forGet
,Put
,Append
,Delete
,MultiGet
,MultiPut
, andAllKeys
inkvstore/concurrent_kvstore.cpp
.When you are finished, you should pass all
kvstore_sequential_tests
. You can check that your implementation is correct by running in thebuild/
directory:If you're failing a test, you should run it individually to see debug output (replace test_name with the name of the test):
For Steps 3 and 4, you can either specify
concurrent
as an argument or just leave it blank (as in the example above).Step 4: Concurrent Bucket-Based KV Store
Overview
In this part, you'll modify your bucket-based key-value store implementation to support concurrent and parallel access using readers-writer locks.
Readers-writer locks
Recall from class the basic rule of synchronization:
Observe that synchronization is only required if at least one thread is writing to shared data. If both threads are reading, then they can safely read concurrently. Regular mutexes, however, do not address this distinction–only one thread can access a critical section at a time, so such concurrent reads are impossible.
One potential optimization is a readers-writer lock, or RWLock (
std::shared_mutex
in C++). RWLocks allow for concurrent access for read-only operations, but exclusive access for write operations. Before entering a critical section, threads may attempt to acquire either a shared (or read) lock or an exclusive (or write) lock. Multiple threads may hold a shared lock at once, but like regular mutexes, only one thread may hold an exclusive lock. To ensure safe concurrent or exclusive access, reader threads cannot acquire a shared lock until there is no writer thread holding an exclusive lock.As with mutexes, it is up to the programmer to properly protect shared data and lock critical sections within the program. RWLocks present an additional challenge over mutexes: programmers must also ensure that threads acquire the correct type of lock before entering a critical section or potentially modifying data. For example, acquiring a shared lock and then modifying shared data is undefined behavior. As you implement key-value store functions, keep in mind which functions are amenable to concurrent access, and which require exclusive access.
Example of RWLocks
To acquire exclusive access to a critical section, use the lock and unlock functions. To acquire shared access, use lock_shared and unlock_shared:
See the
std::shared_mutex
class for complete documentation.In addition, like regular mutexes, C++ provides a wrapper class,
std::shared_lock
, that automatically unlocks the shared mutex once it goes out of scope. This has the same API asstd::unique_lock
:Optional: Why not just use RWLocks all the time?
Especially in applications that experience read-heavy workloads, RWLocks permit more concurrency over regular mutexes. However, RWLocks should not be used as a drop-in replacement for mutexes: RWLocks come with additional overhead over simple mutexes, as they must contain additional state and logic to handle readers and writers. RWLocks also must avoid reader or writer starvation, where one operation is disproportionately favored. Wikipedia's article provides an introduction to potential RWLock implementations and problems.
Task: Add RWLocks to
DbMap
to synchronize access to your buckets.You may not use a single, global mutex. If you find yourself writing code like this:
you are doing something wrong. You will not pass the performance tests, and may face further deductions during grading. The purpose of this step is to support thread-safe, parallel access to the store–a single lock prevents parallelism!
Hints
Think about a more fine-grained approach that makes concurrent access to your store more efficient! See Part 3 for an explanation of coarse-grained vs. fine-grained locking. How can you use multiple RW locks such that multiple threads can operate on your store in parallel?
Synchronization Considerations
Now, it's time to use the primitives you defined to synchronize access to your key-value store. Here are two hints before you get started:
Enforce a consistent locking order. For example, if a given operation requires
lock1
andlock2
, you should enforce that every operation always attempts to locklock1
beforelock2
(or vice versa). Think about why this strategy is necessary to prevent deadlock!Acquire all relevant locks before accessing or modifying the store. We want each operation in our key-value store to be atomic, meaning that they are each executed as a single, indivisible unit. In this context of a key-value store, other operations should not modify relevant buckets while the
MultiGet
,MultiPut
, andAllKeys
operations are running.Consider two concurrent operations: one MultiGet that requires
lock1
andlock2
, and one Delete that requireslock1
. Imagine the following execution order:lock1
, getsvalue1
associated withkey1
, then releaseslock1
.lock1
, deletes thekey1: value1
entry, then releaseslock1
.lock2
, gets some values, than releaseslock2
.When the MultiGet returns, it will returns a value (
value1
) that is no longer in the store. We can prevent this bug by ensuring that the MultiGet owns bothlock1
andlock2
before proceeding, and does not release either until it has finished entirely.Task: Modify the methods in
kvstore/concurrent_kvstore.cpp
to enforce synchronized access, using the primitives you defined in the previous task.When you are finished, you should pass all
kvstore_parallel_tests
. You can check that your implementation is correct by running in thebuild/
directory:If you implemented fine-grained locking correctly, you should also pass the performance tests. You can run the performance tests by running in the
build/
directory:Your multi-threaded solution with fine-grained locking needs to be at least 2x faster than (i.e., achieve twice the throughput of) the single-threaded, sequential implementation.
At this point, your implementation should pass all of the Part A tests! You can test this by running in the
build/
directory:or
Debugging with Sanitizers and GDB
Debugging concurrent code can be trickier than debugging single-threaded code. As threads interact with the shared data structures, it can be difficult to track down data races and deadlocks. Luckily, GDB and thread sanitizers can be really helpful.
Thread Sanitizers
Google’s ThreadSanitizer for C and C++ can help you detect data races, deadlocks, and thread leaks. To run the code with thread sanitizers, compile it with:
When compiling tests, you will need to specify
TSAN=1
too:or
Example
Consider the following example:When you run this code using a thread sanitizer, it will produce the following output:
Note: Open the image in a new tab for a larger image.
GDB
Sometimes if your program hangs due to a deadlock, the thread sanitizer may not be able to return any reports. In these cases, gdb has some more helpful tools. Here are some useful gdb commands:
info threads
: prints out existing threads, the Light Weight Processor (LWP) they’re running on, and the line of code they’re executing. The current thread will have an asterisk next to it.thread <thread-id>
: switches to thread <thread-id>p <mutex>
: Prints the fields of a mutex. Helpful for determining the owner of the lock identified by its Light Weight Processor.CTRL+Z
: Stops the execution of the program. If your program is hanging likely due to a deadlock, this allows you to examine the current state of the program. For example, if one of your threads is blocked on a mutex, you can switch to that thread, see which thread owns the mutex, and then switch to that thread.bt
: Prints out a stacktrace. Often times, threads are blocked on a system call (locking a mutex, waiting on a condition variable, or reading/writing to a file descriptor). Backtrace is useful for discovering where in the codebase the blocking call was called.f <frame-number>
: Switches to the provided frame number. Allows you to switch to frames within files you’re familiar with and determine which function called blocking system calls.Tweeter
Context
Now that you've implemented a simple key-value store, let's build upon it with a use case!
In this part of the assignment, you’ll be working with a specialized and more complicated key-value store for a new social media platform, Tweeter. On Tweeter, users can choose their usernames, write posts that appear on their profiles, and respond to other users’ posts (which appear on both users’ profiles).
In Tweeter’s database, there are five kinds of key-value pairs.
user_id
→ usernamepost_id
→ post contentall_users
→ comma-separated list ofuser_ids
user_id_posts
→ comma-separated list ofpost_ids
that user has postedpost_id_replies
→ comma-separated list ofpost_ids
that respond to postEven though there can be multiple users with the same usernames, every
user_id
is unique. The same goes for posts: even though there can be multiple posts with the same text, every post has a uniquepost_id
.Loading your KVStore with data
We provide you with the data stored on Tweeter's instance of KVStore here. The same data is also in the
gdpr/database.csv
, and you can load it into your KVStore as follows.In one terminal, run this command from the
build
directory to start a KVStore server listening on port 1234 (you can pick any number for the port, it just sets up a rendevous point for your client to connect with the server):build$ ./server 1234
In another terminal, run the client and feed in the data:
build$ ./simple_client 127.0.0.1:1234 < ../gdpr/database.csv
After running this command, typing
print store
into the first terminal (which runs the server) should show that your KVStore contains our dataset. Note that since the KVStore is an in-memory store, you will need to re-load the dataset every time you restart the server, as it loses its contents on shutdown.Now, a client can connect to the server and make API requests; for instance, to fetch
user_1
's data from the server, start up a new client and make aGet
request:build$ ./simple_client 127.0.0.1:1234 get user_1
Regulatory Background
When implementing a real-world storage system handling personal data like Tweeter, it is essential to understand the legal frameworks that govern data privacy. In this section of the assignment, you will implement features pertaining to privacy regulations, with a particular focus on the European Union's General Data Protection Regulation (GDPR). This regulation is designed to ensure that users have agency over their data, including the right to both access and delete their information online.
Task: Please read the specific requirements of the GDPR at the following links.
What about the United States?
The US is moving slower in data protection laws than the EU, but some states have created GDPR-like regulations such as the California Consumer Privacy Act(CCPA).
You can read about its equivalent policies here:
In a nutshell… 🥜
Right to Access
GDPR Right of Access (Article 15) – Users can receive a copy of their own data upon request, as well as the source of any data that was not collected from the subject. When relevant, they can also request: the reason their data has been processed, the categories of data being processed, who can access this data, and the duration for which their data will be stored. They must also be informed when their data is transferred to a third party.
And the United States?
CCPA Right to Know – Users can request access to personal data from companies, as well as: the categories of their data, the categories of sources, the purpose of collecting that data, and information about what categories of data are being transferred to what types of third parties.
What’s the difference?
Not much – for our application they share broadly similar data access rights!
Right to Delete:
GDPR Right to be Forgotten (Article 17) – Users can request the deletion of the personal data under six specific circumstances, including when it is no longer needed for the purpose it was collected, or the user has revoked consent. Companies are not required to comply if the data is essential for: avoiding infringing upon freedom of expression, complying with legal requirements, engaging in scientific or historical research, or reasons in the interest of public health.
And the United States?
CCPA Right to Delete – Users can request the deletion of all personal data at any time. Companies are not required to comply if the data is essential for: fulfilling the purpose for which it was collected, ensuring security and integrity, identifying and repairing bugs in functionality, avoiding infringing upon the rights of another consumer (e.g. to free speech), engaging in scientific, historical, or statistical research, “enabling solely internal uses that are reasonably aligned with the expectations of the consumer,” or complying with legal requirements.
What’s the difference?
The primary difference is that “the GDPR right only applies if the request meets one of six specific conditions while the CCPA right is broad. However, the CCPA also allows business to refuse the request on much broader grounds than the GDPR” (Practical Law).
Stakeholders
Tweeters has users who are in the EU, so Tweeter must comply with the GDPR’s right to access and the right to be forgotten. You are in charge of making a decision on how to handle a particular user's request to exercise their right to access and their right to be forgotten.
Task: Check EdStem! You will find a pinned post containing information about which stakeholder pair (or pairs) you have been assigned to complete this portion of the assignment.
If you have been assigned two stakeholder pairs, please use the first one for all parts of the assignment. You will use the second pair to answer question 3B later!
You will write a function
GDPRDelete()
that performs a delete request for your assigned stakeholder. There are several ways one might implement privacy-conscious deletion (think about what we discussed in section for ideas!). While many different kinds of delete can be acceptable for your stakeholder pair, you cannot opt out of implementing some form of delete and reject their request, although your implementation may not meet all of their hopes and expectations. (In the real world, there are cases where deletion requests have been rejected completely, but for the scope of this assignment, we’ve selected the dataset and the stakeholders such that you should be able to implement some kind of delete, even if it is an unsatisfactory one for your stakeholder.)The caveat with this particular assignment is that you are also accountable to an ethical auditor who is charged with scrutinizing your company’s handling of GDPR compliance. Therefore, when satisfying your stakeholder’s request for deletion in this assignment, you must also consider other stakeholders that could be affected by this deletion and whose legitimate claims and interests might be infringed upon.
The following introduces each stakeholder pair and explains the context in which the deletion request occurs:
PAIR #1:
PAIR #2:
PAIR #3:
PAIR #4:
PAIR #5:
The
GDPRDelete()
FunctionThe signature of
GDPRDelete
is as follows:bool GDPRDelete(std::string& user_id);
In other words, the function receives a single argument, which is the user ID of the data subject who is invoking the right to erasure (i.e., your data subject stakeholder). The function can use this argument to look up data related to the data subject in the KVStore, or to find the user's identifier (e.g., "user_1") in other data.
In implementing
GDPRDelete()
you can consider a number of broad options for data store modification including, but not limited to:Your
GDPRDelete()
is allowed to leave the KVStore in an inconsistent state (e.g., the key "user_1" is deleted from the store, but remains in the "all_users" list), but if this is the case, this should be clearly noted and justified.Note that the function does not receive information about the context in which the deletion happens (e.g., what other users' data the data subject might want to delete, or what the other users' views on this are). Your design could extend the KVStore with auxiliary metadata that captures relevant context (e.g., special key-value pairs that indicate users who are of special interest, such as public figures), and
GDPRDelete()
may draw on this data. Using such metadata is not a requirement, however.We will not grade you on the level of sophistication your
GDPRDelete()
function achieves, but rather on whether it works. As long as your written answers justify your choices, you will get full credit, even if the function itself is simple.Before you write any code…
You might feel overwhelmed with the situation you’ve been presented with — we’ve engineered each situation such that the conflict is intentionally difficult to deal with. Because of this, we want you to consider the following questions before you touch any of the code. You don’t have to submit your answers for this, but we highly recommend writing down some notes for yourself.
Important note: we don’t expect you to find a solution that satisfies everyone — rather, the point is to make the most reasonable tradeoffs between the opposing parties’ claims.
Implementation Tasks
Task 1: Implement
GDPRDelete()
inclient/simple_client.cpp
.Task 2: Leave detailed (header and/or inline) comments in your code that explain what kind of delete you are implementing.
GDPRDelete()
should operate on any keys (such asuser_ID
,post_ID
, etc) you deem necessary to implement your chosen strategy of deletion.To test your
GDPRDelete()
function, you can use thegdprdelete <user>
command in the client as follows:You can then use
print store
in the server to see how your store contents have changed.(Go back to Loading your KVStore with data if you need a reminder on how to start up your KVStore.)
After you’ve finished implementing delete…
The right to know and the right to delete affect all stakeholders of a database: its users, its operators, and those who use the data for technology, studies, and historical records. Now that you have implemented your version of privacy-conscious access and deletion, you should consider the potential pitfalls of your design choices in practice.
Task: Answer the following questions in your README file:
You may want to read the "How will you be graded" section below before beginning this task to understand what we are looking for!
Q1. Who was your stakeholder pair? (1 sentence)
Q2. What kind of delete did you implement and why? Explain your decisions as if you were reporting to an ethical auditor who will determine whether your design is justified. Highlight how your decisions address stakeholder concerns, and explain why you addressed their concerns in this way. If you purposely ignored any stakeholder concerns in your implementation, explain why they were not addressed (1-2 short paragraphs)
Q3. In the real world, Twitter, Google, Airbnb, and other companies write one implementation of GDPR-compliant account removal to realize the right to erasure. They do not write per-user functions! In the next part of the assignment, you'll think about how to generalize your implementation.
Answer the applicable question based on whether we assigned you one or two stakeholder pairs. Please mark which version of the question you are answering in your README. (1-3 short paragraphs)
Q3A. If we assigned you only one stakeholder pair: What are the shortcomings of your implementation? How might your approach to this assignment change if you were asked to create a general
GDPRDelete()
suitable for Right to be Forgotten requests from all of Tweeter's users? Consider specifically some other short and/or long term stakeholders within your scenario for whom your implementation might not be suitable.Q3B. If you were assigned two stakeholder pairs: What are the shortcomings of your implementation? How might your approach to this assignment change if you were asked to create a general
GDPRDelete()
suitable for Right to be Forgotten requests from all of Tweeter's users? Consider specifically the concerns of your second pair of stakeholders, and whether your implementation ofGDPRDelete()
would be suitable for these stakeholders as well.Along with considering additional stakeholders, you can also consider other outside entities who might be affected by your implementation of
GDPRDelete()
.How will you be graded?
A good response identifies the legitimate claims that each stakeholder may have, explains why those claims are important, and compares the importance of both claims. It provides concrete reasons for (fully or partially) prioritizing or rejecting individual stakeholders’ claims and how those trade-offs are reflected in the chosen implementation of the right to be forgotten. A good response, importantly, also touches upon the limitations of those choices.
Your response should show that you have thought about the interests of your stakeholders beyond the directly legal implications. That is, you must use some justification outside of the GDPR itself for the design of your implementation. Because GDPR requirements are defined rather loosely, you will have naturally made decisions beyond those required legally; a good response will highlight some of these decisions. You may also think about some of the values that motivate the GDPR requirements that you have drawn upon.
Handing In and Grading
Handin Instructions
As before, you will hand in your code using Git. In the
kvstore/
subdirectory of your projects repository, you MUST fill in the text file calledREADME-5A.md
.By 6:00pm on Friday, April 26th, you must have a commit with all steps completed.
Remind me again what the README should contain?
The README.md file will include the following:GDPRDelete()
implementation.Grading Breakdown
This project is graded out of 140 points, but will be weighted equally with the other projects in the course.
For the KVStore implementation parts:
For the Tweeter part of the project:
GDPRDelete
.Important note on using late hours:
You will be doing Project 5B in the same directory as Project 5A. We will set everyone's grading commit at the normal project deadline so that your Project 5B work won't incur late hours for your Project 5A submission.
However, if you are using late hours, please flag the commit you want graded so that we grade the right commit and assign the right number of late hours to Project 5A.
Now head to the grading server, make sure that you have the “KVStore” page configured correctly with your project repository, and check that your KVStore runs on the grading server as expected.
Congratulations, you’ve completed the first part of the fifth CS 300 project!
Acknowledgements: This project was developed for CS 300 by Richard Tang, Carolyn Zech, Cedric Sirianni, and Malte Schwarzkopf. The Tweeter portion of this project was developed for CS 300 by Eva Schiller, Eva Lau, Colton Rusch, and Malte Schwarzkopf.