Programming odds and ends — InfiniBand, RDMA, and low-latency networking for now.

Latest

gcsfs, a FUSE driver for Google Cloud Storage, now available

gcsfs is a FUSE driver for Google Cloud Storage with much the same functionality as s3fuse. It isn’t quite a fork — for the moment, both drivers are very similar — but this makes it easier to separate future development of Google Cloud Storage-specific features. Some key features:

  • Binaries for Debian 7, RHEL/CentOS 6, Ubuntu, and OS X 10.9.
  • Compatible with GCS web console.
  • Caches file/directory metadata for improved performance.
  • Verifies file consistency on upload/download.
  • Optional file content encryption.
  • Supports multi-part (resumable) uploads.
  • Allows setting Cache-Control header (via extended attributes on files).
  • Maps file extensions to known MIME types (to set Content-Type header).
  • (Mostly) POSIX compliant.

Binaries (and source packages) are in gcsfs-downloads (on Google Drive).

Users of CentOS images on Google Compute Engine can install pre-built binaries by downloading gcsfs-0.15-1.x86_64.rpm, then using yum:

user@gce-instance-centos ~]$ sudo yum localinstall gcsfs-0.15-1.x86_64.rpm
...
user@gce-instance-centos ~]$ gcsfs -V
gcsfs, 0.15 (r792M), FUSE driver for cloud object storage services
enabled services: aws, fvs, google-storage
user@gce-instance-centos ~]$

For Debian download gcsfs_0.15-1_amd64.deb, then:

user@gce-instance-debian:~$ sudo dpkg -i gcsfs_0.15-1_amd64.deb
... a bunch of dependency errors ...
user@gce-instance-debian:~$ sudo apt-get install -f
...
user@gce-instance-debian:~$ gcsfs -V
gcsfs, 0.15 (r792M), FUSE driver for cloud object storage services
enabled services: aws, fvs, google-storage
user@gce-instance-debian:~$

s3fuse 0.15 released

s3fuse 0.15 is now available. This release contains fairly minor fixes and packaging updates. Highlights from the change log:

Removed libxml++ dependency.
libxml++ was pulling in many unnecessary package dependencies and wasn’t really providing much added value over libxml2, so as of 0.15 it’s gone. As a bonus, it’s no longer necessary to enable the EPEL repository on RHEL/CentOS before installing s3fuse.

Fixed libcurl init/cleanup bug.
0.14 and earlier versions had a bug that sometimes prevented establishment of SSL connections if s3fuse ran in daemonized (background) mode. 0.15 addresses this.

Binaries for RHEL/CentOS, Debian, and OS X, as well as source archives, are now hosted in s3fuse-downloads (on Google Drive).

Ubuntu packages are at the s3fuse PPA.

Sample Code Repository

Updated code for the RDMA tutorials is now hosted at GitHub:

https://github.com/tarickb/the-geek-in-the-corner

Special thanks to SeongJae P. for the Makefile fixes.

Importing VHD Disk Images into XCP

Earlier today I needed to move a VHD disk image from VirtualBox into XCP, but couldn’t find an obvious way to do this. The xe vdi-import command only seems to work with raw disk images, and I didn’t want to convert my VHD into a raw image only to have XCP convert it back to a VHD. What I ended up doing was creating a new VDI in an NFS storage repository, then overwriting the image.

The main caveat to this approach is that it only works with VHD images, and only with NFS repositories (at least that’s what I gather from the documentation).

Also, don’t blame me if something goes wrong and you accidentally wipe out your mission-critical disk image, or your XCP host, etc.

So here’s what I did:

  1. Get the SR ID: SR_ID=$(xe sr-list type=nfs --minimal)
  2. Get the associated PBD: PBD_ID=$(xe sr-param-list uuid=$SR_ID | grep 'PBDs' | sed -e 's/.*: //')
  3. Get the NFS path: xe pbd-param-list uuid=$PBD_ID | grep 'device-config'
  4. Create an image, specifying an image size larger than the VDI: VDI_ID=$(xe vdi-create sr-uuid=$SR_ID name-label='new image' type=user virtual-size=<image-size>)
  5. Mount the NFS export locally: mount <nfs-server>:<nfs-path> /mnt
  6. Replace the VHD: cp <your-vhd-file> /mnt/$SR_ID/$VDI_ID.vhd

s3fuse 0.14 released

Over the weekend I posted version 0.14 of s3fuse. Highlights from the change log:

NEW: Multipart (resumable) uploads for Google Cloud Storage.
With this most recent release, Google Cloud Storage support is on par with S3 support. Multipart/resumable uploads and downloads work reliably, and performance is similar. Many thanks to Eric J. at Google for all the help improving GCS support in 0.14.

NEW: Support for FV/S.
With the help of Hiroyuki K. at IIJ, s3fuse now supports IIJ’s FV/S cloud storage service.

NEW: Set file content type by examining extension.
s3fuse will now set the HTTP Content-Type header according to the file extension using the system MIME map.

NEW: Set Cache-Control header with extended attribute.
If a Cache-Control header is returned for an object, it will be available in the s3fuse_cache_control extended attribute. Setting the extended attribute (with, say, setfattr) will update the Cache-Control header for the object.

NEW: Allow creation of “special” files (FIFOs, devices, etc.).
mkfifo and mknod now work in s3fuse-mounted buckets, with semantics similar to NFS-mounted filesystems (in particular: FIFOs do not form IPC channels between hosts).

Various POSIX compliance fixes.
From the README:

s3fuse is mostly POSIX compliant, except that:

  • Last-access time (atime) is not recorded.
  • Hard links are not supported.

Some notes on testing:

All tests should pass, except:

  chown/00.t: 141, 145, 149, 153
    [FUSE doesn't call chown when uid == -1 and gid == -1]

  mkdir/00.t: 30
  mkfifo/00.t: 30
  open/00.t: 30
    [atime is not recorded]

  link/*.t
  rename/00.t: 7-9, 11, 13-14, 27-29, 31, 33-34
  unlink/00.t: 15-17, 20-22, 51-53
    [hard links are not supported]

As with 0.13, Ubuntu packages are at the s3fuse PPA.

s3fuse 0.13 released

I’ve just uploaded version 0.13 of s3fuse, my FUSE driver for Amazon S3 (and Google Cloud Storage) buckets. 0.13 is a near-complete rewrite of s3fuse, and brings a few new features and vastly improved (I hope) robustness. From the change log:

NEW: File encryption
Operates at the file level and encrypts the contents of files with a key (or set of keys) that you control. See the README.

NEW: Glacier restore requests
Allows for the restoration of files auto-archived to Glacier. See this AWS blog post and the README for more information.

NEW: OS X binaries
A disk image (.dmg) is now available on the downloads page containing pre-built OS X binaries (built on OS X 10.8.2, so compatibility may be limited).

NEW: Size-limited object cache
The object attribute cache now has a fixed size. This addresses the memory utilization issues reported by Gregory C. and others. The default maximum size is 1,000 objects, but this can be changed by tweaking the max_objects_in_cache configuration option.

IMPORTANT: Removed auth_data configuration option
For AWS, use aws_secret_file instead. For Google Storage, use gs_token_file. This will require a change to existing configuration files.

IMPORTANT: Default configuration file locations
s3fuse now searches for a configuration file in ~/.s3fuse/s3fuse.conf before trying %sysconfdir/s3fuse.conf (this is usually /etc/s3fuse.conf or /usr/local/s3fuse.conf).

File Hashing
SHA256 is now used for file integrity checks. The file hash, if available, will be in the “s3fuse_sha256” extended attribute. A standalone SHA256 hash generator (“s3fuse_sha256_hash”) that uses the same hash-list method as s3fuse is now included.

Statistics
Set the stats_file configuation option to a valid path and s3fuse will write statistics (event counters, mainly) to the given path when unmounting.

OS X default options
noappledouble and daemon_timeout=3600 are now default FUSE options on OS X.

KNOWN ISSUE: Google Cloud Storage large file uploads
Multipart GCS uploads are not implemented. Large files will time out unless the transfer_timeout_in_s configuration option is set to something very large.

RDMA tutorial PDFs

In cooperation with the HPC Advisory Council, I’ve reformatted three of my RDMA tutorials for easier offline reading. You can find them, along with several papers on InfiniBand, GPUs, and other interesting topics, at the HPC Training page. For easier access, here are my three papers:

Basic flow control for RDMA transfers

Commenter Matt recently asked about sending large(r) amounts of data:

… I’m wondering if you would be able to provide some pointers or even examples that send very large amounts of data. e.g. sending files up to or > 2GB. Your examples use 1024 byte buffers. I suspect there is an efficient way of doing this given that there is a 2**31 limit for the message size.

I should point out that I don’t have lots of memory available as it’s used for other things.

There are many ways to do this, but since I’ve already covered using send/receive operations and using RDMA read/write, this would be a good occasion to combine elements of both and talk about how to handle flow control in general. I’ll also talk a bit about the RDMA-write-with-immediate-data (IBV_WR_RDMA_WRITE_WITH_IMM) operation, and I’ll illustrate these methods with a sample that transfers, using RDMA, a file specified on the command line.

As in previous posts, our sample consists of a server and a client. The server waits for connections from the client. The client does essentially two things after connecting to the server: it sends the name of the file it’s transferring, and then sends the contents of the file. We won’t concern ourselves with the nuts and bolts of establishing a connection; that’s been covered in previous posts. Instead, we’ll focus on synchronization and the flow control. What I’ve done with the code for this post though is invert the structure I built for my RDMA read/write post — there, I had the connection management code separated into client.c and server.c with the completion-processing code in common.c whereas here I’ve centralized the connection management in common.c and divided the completion processing between client.c and server.c.

Back to our example. There are many ways we could orchestrate the transfer of an entire file from client to server. For instance:

  • Load the entire file into client memory, connect to the server, wait for the server to post a set of receives, then issue a send operation (on the client side) to copy the contents to the server.
  • Load the entire file into client memory, register the memory, pass the region details to the server, let it issue an RDMA read to copy the entire file into its memory, then write the contents to disk.
  • As above, but issue an RDMA write to copy the file contents into server memory, then signal it to write to disk.
  • Open the file on the client, read one chunk, wait for the server to post a receive, then post a send operation on the client side, and loop until the entire file is sent.
  • As above, but use RDMA reads.
  • As above, but use RDMA writes.

Loading the entire file into memory can be impractical for large files, so we’ll skip the first three options. Of the remaining three, I’ll focus on using RDMA writes so that I can illustrate the use of the RDMA-write-with-immediate-data operation, something I’ve been meaning to talk about for a while. This operation is similar to a regular RDMA write except that the initiator can “attach” a 32-bit value to the write operation. Unlike regular RDMA writes, RDMA writes with immediate data require that a receive operation be posted on the target’s receive queue. The 32-bit value will be available when the completion is pulled from the target’s queue.

Update, Dec. 26: Roland D. rather helpfully pointed out that RDMA write with immediate data isn’t supported by iWARP adapters. We could rewrite to use an RDMA write (without immediate data) followed by a send, but this is left as an exercise for the reader.

Now that we’ve decided we’re going to break up the file into chunks, and write the chunks one at a time into the server’s memory, we need to find a way to ensure we don’t write chunks faster than the server can process them. We’ll do this by having the server send explicit messages to the client when it’s ready to receive data. The client, on the other hand, will use writes with immediate data to signal the server. The sequence looks something like this:

  1. Server starts listening for connections.
  2. Client posts a receive operation for a flow-control message and initiates a connection to the server.
  3. Server posts a receive operation for an RDMA write with immediate data and accepts the connection from the client.
  4. Server sends the client its target memory region details.
  5. Client re-posts a receive operation then responds by writing the name of the file to the server’s memory region. The immediate data field contains the length of the file name.
  6. Server opens a file descriptor, re-posts a receive operation, then responds with a message indicating it is ready to receive data.
  7. Client re-posts a receive operation, reads a chunk from the input file, then writes the chunk to the server’s memory region. The immediate data field contains the size of the chunk in bytes.
  8. Server writes the chunk to disk, re-posts a receive operation, then responds with a message indicating it is ready to receive data.
  9. Repeat steps 7, 8 until there is no data left to send.
  10. Client re-posts a receive operation, then initiates a zero-byte write to the server’s memory. The immediate data field is set to zero.
  11. Server responds with a message indicating it is done.
  12. Client closes the connection.
  13. Server closes the file descriptor.

A diagram may be helpful:
File Transfer

Looking at this sequence we see that the server only ever sends small messages to the client and only ever receives RDMA writes from the client. The client only ever executes RDMA writes and only ever receives small messages from the server.

Let’s start by looking at the server. The connection-establishment details are now hidden behind rc_init(), which sets various callback functions, and rc_server_loop(), which runs an event loop:

int main(int argc, char **argv)
{
  rc_init(
    on_pre_conn,
    on_connection,
    on_completion,
    on_disconnect);

  printf("waiting for connections. interrupt (^C) to exit.\n");

  rc_server_loop(DEFAULT_PORT);

  return 0;
}

The callback names are fairly obvious: on_pre_conn() is called when a connection request is received but before it is accepted, on_connection() is called when a connection is established, on_completion() is called when an entry is pulled from the completion queue, and on_disconnect() is called upon disconnection.

In on_pre_conn(), we allocate a structure to contain various connection context fields (a buffer to contain data from the client, a buffer from which to send messages to the client, etc.) and post a receive work request for the client’s RDMA writes:

static void post_receive(struct rdma_cm_id *id)
{
  struct ibv_recv_wr wr, *bad_wr = NULL;

  memset(&wr, 0, sizeof(wr));

  wr.wr_id = (uintptr_t)id;
  wr.sg_list = NULL;
  wr.num_sge = 0;

  TEST_NZ(ibv_post_recv(id->qp, &wr, &bad_wr));
}

What’s interesting here is that we’re setting sg_list = NULL and num_sge = 0. Incoming RDMA write requests will specify a target memory address, and since this work request is only ever going to match incoming RDMA writes, we don’t need to use sg_list and num_sge to specify a location in memory for the receive. After the connection is established, on_connection() sends the memory region details to the client:

static void on_connection(struct rdma_cm_id *id)
{
  struct conn_context *ctx = (struct conn_context *)id->context;

  ctx->msg->id = MSG_MR;
  ctx->msg->data.mr.addr = (uintptr_t)ctx->buffer_mr->addr;
  ctx->msg->data.mr.rkey = ctx->buffer_mr->rkey;

  send_message(id);
}

This prompts the client to begin issuing RDMA writes, which trigger the on_completion() callback:

static void on_completion(struct ibv_wc *wc)
{
  struct rdma_cm_id *id = (struct rdma_cm_id *)(uintptr_t)wc->wr_id;
  struct conn_context *ctx = (struct conn_context *)id->context;

  if (wc->opcode == IBV_WC_RECV_RDMA_WITH_IMM) {
    uint32_t size = ntohl(wc->imm_data);

    if (size == 0) {
      ctx->msg->id = MSG_DONE;
      send_message(id);

      // don't need post_receive() since we're done with this connection

    } else if (ctx->file_name[0]) {
      ssize_t ret;

      printf("received %i bytes.\n", size);

      ret = write(ctx->fd, ctx->buffer, size);

      if (ret != size)
        rc_die("write() failed");

      post_receive(id);

      ctx->msg->id = MSG_READY;
      send_message(id);

    } else {
      memcpy(ctx->file_name, ctx->buffer, (size > MAX_FILE_NAME) ? MAX_FILE_NAME : size);
      ctx->file_name[size - 1] = '\0';

      printf("opening file %s\n", ctx->file_name);

      ctx->fd = open(ctx->file_name, O_WRONLY | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);

      if (ctx->fd == -1)
        rc_die("open() failed");

      post_receive(id);

      ctx->msg->id = MSG_READY;
      send_message(id);
    }
  }
}

We retrieve the immediate data field in line 7 and convert it from network byte order to host byte order. We then test three possible conditions:

  1. If size == 0, the client has finished writing data (lines 9–14). We acknowledge this with MSG_DONE.
  2. If the first byte of ctx->file_name is set, we already have the file name and have an open file descriptor (lines 15–30). We call write() to append the client’s data to our open file then reply with MSG_READY, indicating we’re ready to accept more data.
  3. Otherwise, we have yet to receive the file name (lines 30–45). We copy it from the incoming buffer, open a file descriptor, then reply with MSG_READY to indicate we’re ready to receive data.

Upon disconnection, in on_disconnect(), we close the open file descriptor and tidy up memory registrations, etc. And that’s it for the server!

On the client side, main() is a little more complex in that we need to pass the server host name and port into rc_client_loop():

int main(int argc, char **argv)
{
  struct client_context ctx;

  if (argc != 3) {
    fprintf(stderr, "usage: %s <server-address> <file-name>\n", argv[0]);
    return 1;
  }

  ctx.file_name = basename(argv[2]);
  ctx.fd = open(argv[2], O_RDONLY);

  if (ctx.fd == -1) {
    fprintf(stderr, "unable to open input file \"%s\"\n", ctx.file_name);
    return 1;
  }

  rc_init(
    on_pre_conn,
    NULL, // on connect
    on_completion,
    NULL); // on disconnect

  rc_client_loop(argv[1], DEFAULT_PORT, &ctx);

  close(ctx.fd);

  return 0;
}

We don’t provide on-connection or on-disconnect callbacks because these events aren’t especially relevant to the client. The on_pre_conn() callback is fairly similar to the server’s, except that the connection context structure is pre-allocated, and the receive work request we post (in post_receive()) requires a memory region:

static void post_receive(struct rdma_cm_id *id)
{
  struct client_context *ctx = (struct client_context *)id->context;

  struct ibv_recv_wr wr, *bad_wr = NULL;
  struct ibv_sge sge;

  memset(&wr, 0, sizeof(wr));

  wr.wr_id = (uintptr_t)id;
  wr.sg_list = &sge;
  wr.num_sge = 1;

  sge.addr = (uintptr_t)ctx->msg;
  sge.length = sizeof(*ctx->msg);
  sge.lkey = ctx->msg_mr->lkey;

  TEST_NZ(ibv_post_recv(id->qp, &wr, &bad_wr));
}

We point sg_list to a buffer large enough to hold a struct message. The server will use this to pass along flow control messages. Each message will trigger a call to on_completion(), which is where the client does the bulk of its work:

static void on_completion(struct ibv_wc *wc)
{
  struct rdma_cm_id *id = (struct rdma_cm_id *)(uintptr_t)(wc->wr_id);
  struct client_context *ctx = (struct client_context *)id->context;

  if (wc->opcode & IBV_WC_RECV) {
    if (ctx->msg->id == MSG_MR) {
      ctx->peer_addr = ctx->msg->data.mr.addr;
      ctx->peer_rkey = ctx->msg->data.mr.rkey;

      printf("received MR, sending file name\n");
      send_file_name(id);
    } else if (ctx->msg->id == MSG_READY) {
      printf("received READY, sending chunk\n");
      send_next_chunk(id);
    } else if (ctx->msg->id == MSG_DONE) {
      printf("received DONE, disconnecting\n");
      rc_disconnect(id);
      return;
    }

    post_receive(id);
  }
}

This matches the sequence described above. Both send_file_name() and send_next_chunk() ultimately call write_remote():

static void write_remote(struct rdma_cm_id *id, uint32_t len)
{
  struct client_context *ctx = (struct client_context *)id->context;

  struct ibv_send_wr wr, *bad_wr = NULL;
  struct ibv_sge sge;

  memset(&wr, 0, sizeof(wr));

  wr.wr_id = (uintptr_t)id;
  wr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM;
  wr.send_flags = IBV_SEND_SIGNALED;
  wr.imm_data = htonl(len);
  wr.wr.rdma.remote_addr = ctx->peer_addr;
  wr.wr.rdma.rkey = ctx->peer_rkey;

  if (len) {
    wr.sg_list = &sge;
    wr.num_sge = 1;

    sge.addr = (uintptr_t)ctx->buffer;
    sge.length = len;
    sge.lkey = ctx->buffer_mr->lkey;
  }

  TEST_NZ(ibv_post_send(id->qp, &wr, &bad_wr));
}

This RDMA request differs from those used in earlier posts in two ways: we set opcode to IBV_WR_RDMA_WRITE_WITH_IMM, and we set imm_data to the length of our buffer.

That wasn’t too bad, was it? If everything’s working as expected, you should see the following:

ib-host-1$ ./server 
waiting for connections. interrupt (^C) to exit.
opening file test-file
received 10485760 bytes.
received 10485760 bytes.
received 5242880 bytes.
finished transferring test-file
^C

ib-host-1$ md5sum test-file
5815ed31a65c5da9745764c887f5f777  test-file
ib-host-2$ dd if=/dev/urandom of=test-file bs=1048576 count=25
25+0 records in
25+0 records out
26214400 bytes (26 MB) copied, 3.11979 seconds, 8.4 MB/s

ib-host-2$ md5sum test-file
5815ed31a65c5da9745764c887f5f777  test-file

ib-host-2$ ./client ib-host-1 test-file
received MR, sending file name
received READY, sending chunk
received READY, sending chunk
received READY, sending chunk
received READY, sending chunk
received DONE, disconnecting

If instead you see an error during memory registration, such as the following, you may need to increase your locked memory resource limits:

error: ctx->buffer_mr = ibv_reg_mr(rc_get_pd(), ctx->buffer, BUFFER_SIZE, IBV_ACCESS_LOCAL_WRITE) failed (returned zero/null).

The OpenMPI FAQ has a good explanation of how to do this.

Once more, the sample code is available here.

Updated, Dec. 21: Updated post to describe locked memory limit errors, and updated sample code to: check for ibv_reg_mr() errors; use basename() of file path rather than full path; add missing mode parameter to open() call; add missing library reference to Makefile. Thanks Matt.

Updated, Oct. 4: Sample code is now at https://github.com/tarickb/the-geek-in-the-corner/tree/master/03_file-transfer.

OS X support and other s3fuse news

Version 0.12 of my pet project, s3fuse, now supports OS X (via FUSE4x). A few notes/caveats:

  • Only FUSE4x is supported. OSXFUSE is not.
  • -o noappledouble is your friend. It will keep OS X from filling your S3 bucket with .DS_Store files as you browse the mounted volume.
  • Set a reasonable daemon timeout (e.g., -o daemon_timeout=3600) to keep FUSE4x from timing out and aborting large uploads/downloads

Sometime in January I’ll release version 0.13, which is a near-complete rewrite that adds support for file-level encryption. I’m also working on adding support for file retrieval from Glacier (for files archived by S3 — see this post on the AWS blog).

s3fuse, now with Google Storage support

Just posted version 0.11 of s3fuse, my FUSE driver for Amazon S3 and, now, Google Storage for Developers. 0.11 also improves stability, error handling, logging, and directory caching. Give it a try. In addition to a source tarball, packages are available for both Debian and Red Hat.