| Name | Last modified | Size | Description | |
|---|---|---|---|---|
| Parent Directory | - | |||
| jiakyl2.html | 20-Jun-2006 17:05 | 3.5K | ||
| jiakyl3.html | 20-Jun-2006 17:06 | 10K | ||
| 20-Jun-2006 16:53 | 45K | |||
| prog4.tgz | 20-Jun-2006 16:53 | 32K | ||
| prog4/ | 20-Jun-2006 14:25 | - | ||
Our goal for this project was to add local and network replication mechanisms to the anemone driver to improve its reliability. Network replication allows the driver to overcome a remote server failure by sending requests to a mirror server. Local replication allows the driver to function in the case of a complete network failure by mirroring page-outs to a local swap space and paging in from that space when the network is down. Implementing network replication was almost trivial, although we did have to make modifications to the hashtable mapping mechanism. Local replication has turned out to be a much greater challenge than originally thought and is not at all implemented yet.
In order to implement network replication we simply need to send pages to more than one server at a time. When a page-in request to one of the servers fails, we just mark that server as dead and continue trying to page-in from another server holding the page we want. We first tried replicating among two servers by trivially adding a second mapping hash table. This method worked initially, but Micheal pointed out that the use of the second mapping table introduced considerable memory overhead (2MB). Jian redesigned the mapping table entries to store the page's location on the network as a bitfield. In this way we can simply determine the locations of a page with minimal overhead.
Here is the message flow between client and servers:
Note: Page out1 and Page out2 use the same sending packet it has different destination address.
select the least utilized server1 and the second least utilized server2 first.
client server1 server2
Page out1
0------------------------>
Page out2
0--------------------------------------->
ack1 callback
<-----------------0
ack2 delete sending packet
<----------------------------------------0
If server1 fails:
page in
0--------------------------------------->
callback and delete the sending packet
<----------------------------------------0
Since these process ack is done in one function, we need some logical modification here.
Michael Hines' suggestions:
Some observations:
1. Should the success notification sent back to the swap daemon, i.e. bio_endio(), not wait until both the 1st page-out and the replicated page have both successfully been ACK'd?
2. Won't a second mapping table be more over head than using a single one? Memory usage for data-structures is important for the client, especially for a memory-constrained client. Mapping the replicated page could happen in two different ways:
an extra 5 words.
Would you rather add 2 words per offset or an extra hashtable with a lot more bytes per offset? And what happens when you want to increase the replication factor to 3? You can't just pop in another hashtable.
We're implementing local replication in order to provide a recovery mode in the case of total network failure. The idea is to mirror all page-outs sent to the network in a local swap file. If the network ever fails, page-ins are read in from the swap file and page-outs continue to be mirrored to the local file. Because we would have to mirror every page out over the network in our local file, we must try to maintain a low write latency to avoid losing all advantages of swapping over the network. The only way to do this is to asynchronously write to the local disk by storing the page being swapped out to a local buffer and adding it to a write out queue. We would then have a thread which asynchronously processes the write queue as it is filled. We could also implement a similar mechanism for asynchronous reading from the disk, but because this only happens in recovery mode, latency is not a priority.
Implementing these features would require few changes to the existing anemone code: a relatively small amount of new code implementing the asynchronous processing and code for interfacing with the rest of the driver. We would only have to modify the cache_add and cache_retrieve functions and make them call our local_write and local_read functions. cach_add would call local_write whenever something is added to the cache and cache_retrieve would call local_read whenever a network retrieval fails or the network is known to be down. local_read/write functions would fill a request queue, schedule the request processing thread and then return to the swap daemon. The request processing thread would then handle the reading and writing directly to/from the swap file and wakeup any other threads that may have been waiting on the request. We want to reuse the existing write cache so we would also have to modify some of the cache management functions to keep pages being written locally in the cache until the write has completed.
Our original implementation plan for local replication was to simply have our driver open a swap file on initialization and then just read and write pages to it at the same offsets given by the swap daemon. This turned out to be a lot easier said than done, since the linux kernel provides no api for accessing the logical filesystem from kernel space and the community frowns on any hacks that do so. After some searching it seems the only way to access a local swap space is directly through its block device interface. We would then have to send requests directly to that block device whenever we needed to page in or out. This is similar to what is done in software raid drivers which wrap real block devices with a logical block device implementing raid policy. Unfortunately the software raid driver in linux (md) is highly complicated but we have gleaned some useful infomration from it.
There doesn't seem to be any way to reuse the drivers and infrastructure in place to achieve what we want. The device mapper and md drivers seem to be heavily tailored towards implementing logical volume management and software raid devices. Studying that code has revealed the mechanisms they used to access the block device. We want to be able to set the block device we should swap to as a command line option or ioctl call. To do this we can call the path_lookup function in linux/namei.h and extract the device number in a similar fashion to this function found in drivers/md/dm-table.c:
301 /*
302 * Convert a device path to a dev_t.
303 */
304 static int lookup_device(const char *path, dev_t *dev)
305 {
306 int r;
307 struct nameidata nd;
308 struct inode *inode;
309
310 if ((r = path_lookup(path, LOOKUP_FOLLOW, &nd)))
311 return r;
312
313 inode = nd.dentry->d_inode;
314 if (!inode) {
315 r = -ENOENT;
316 goto out;
317 }
318
319 if (!S_ISBLK(inode->i_mode)) {
320 r = -ENOTBLK;
321 goto out;
322 }
323
324 *dev = inode->i_rdev;
325
326 out:
327 path_release(&nd);
328 return r;
329 }
Once we have the device number, we prepare to issue requests to it by creating bio_vec and bio structs which point to the pages and callback function we specify and most importantly point to the block device we're swapping to. We then call submit_bio (in linux/fs.h) to tell the block device layer to start processing our request. Doing it this way should work out well since block device layer is running concurrently, we shouldn't even need any additional threads. For a page-out from the swap daemon, early in cache_add we'd call our local_write function which would create the bio and other necessary structures for the write request. Our function would then call submit_bio and return to finish off the remainder of cache_add. In the case of network failure and we need to page-in, a call would be made to local_read which would again create a bio struct and other needed structs for the read and call submit_bio again. Since we probably want to read synchronously we would then make the swap daemons thread wait in out local_read function for our io completion callback to wake us up when the data has been read. This is all similar to what is done in the device mapper io system in devices/md/dm-io.c which is where this information came from.