Programming odds and ends — InfiniBand, RDMA, and low-latency networking for now.

Building an RDMA-capable application with IB verbs, part 2: the server

In my last post, I covered some basics and described the steps involved in setting up a connection from both the passive/server and active/client sides. In this post I’ll describe the passive side. To recap, the steps involved are:

  1. Create an event channel so that we can receive rdmacm events, such as connection-request and connection-established notifications.
  2. Bind to an address.
  3. Create a listener and return the port/address.
  4. Wait for a connection request.
  5. Create a protection domain, completion queue, and send-receive queue pair.
  6. Accept the connection request.
  7. Wait for the connection to be established.
  8. Post operations as appropriate.

Since almost everything is handled asynchronously, we’ll structure our code as an event-processing loop and a set of event handlers. First, the fundamentals:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <rdma/rdma_cma.h>

#define TEST_NZ(x) do { if ( (x)) die("error: " #x " failed (returned non-zero)." ); } while (0)
#define TEST_Z(x)  do { if (!(x)) die("error: " #x " failed (returned zero/null)."); } while (0)

static void die(const char *reason);

int main(int argc, char **argv)
{
  return 0;
}

void die(const char *reason)
{
  fprintf(stderr, "%s\n", reason);
  exit(EXIT_FAILURE);
}

Next, we set up an event channel, create an rdmacm ID (roughly analogous to a socket), bind it, and wait in a loop for events (namely, connection requests and connection-established notifications). main() becomes:

static void on_event(struct rdma_cm_event *event);

int main(int argc, char **argv)
{
  struct sockaddr_in addr;
  struct rdma_cm_event *event = NULL;
  struct rdma_cm_id *listener = NULL;
  struct rdma_event_channel *ec = NULL;
  uint16_t port = 0;

  memset(&addr, 0, sizeof(addr));
  addr.sin_family = AF_INET;

  TEST_Z(ec = rdma_create_event_channel());
  TEST_NZ(rdma_create_id(ec, &listener, NULL, RDMA_PS_TCP));
  TEST_NZ(rdma_bind_addr(listener, (struct sockaddr *)&addr));
  TEST_NZ(rdma_listen(listener, 10)); /* backlog=10 is arbitrary */

  port = ntohs(rdma_get_src_port(listener));

  printf("listening on port %d.\n", port);

  while (rdma_get_cm_event(ec, &event) == 0) {
    struct rdma_cm_event event_copy;

    memcpy(&event_copy, event, sizeof(*event));
    rdma_ack_cm_event(event);

    if (on_event(&event_copy))
      break;
  }

  rdma_destroy_id(listener);
  rdma_destroy_event_channel(ec);

  return 0;
}

ec is a pointer to the rdmacm event channel. listener is a pointer to the rdmacm ID for our listener. We specified RDMA_PS_TCP when creating it, which indicates that we want a connection-oriented, reliable queue pair. RDMA_PS_UDP would indicate a connectionless, unreliable queue pair.

We then bind this ID to a socket address. By setting the port, addr.sin_port, to zero, we instruct rdmacm to pick an available port. We’ve also indicated that we want to listen for connections on any available RDMA interface/device.

Our event loop gets an event from rdmacm, acknowledges the event, then processes it. Failing to acknowledge events will result in rdma_destroy_id() blocking. The event handler for the passive side of the connection is only interested in three events:

static void on_connect_request(struct rdma_cm_id *id);
static void on_connection(void *context);
static void on_disconnect(struct rdma_cm_id *id);

int on_event(struct rdma_cm_event *event)
{
  int r = 0;

  if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST)
    r = on_connect_request(event->id);
  else if (event->event == RDMA_CM_EVENT_ESTABLISHED)
    r = on_connection(event->id->context);
  else if (event->event == RDMA_CM_EVENT_DISCONNECTED)
    r = on_disconnect(event->id);
  else
    die("on_event: unknown event.");

  return r;
}

rdmacm allows us to associate a void * context pointer with an ID. We’ll use this to attach a connection context structure:

struct connection {
  struct ibv_qp *qp;

  struct ibv_mr *recv_mr;
  struct ibv_mr *send_mr;

  char *recv_region;
  char *send_region;
};

This contains a pointer to the queue pair (redundant, but simplifies the code slightly), two buffers (one for sends, the other for receives), and two memory regions (memory used for sends/receives has to be “registered” with the verbs library). When we receive a connection request, we first build our verbs context if it hasn’t already been built. Then, after building our connection context structure, we pre-post our receives (more on this in a bit), and accept the connection request:

static void build_context(struct ibv_context *verbs);
static void build_qp_attr(struct ibv_qp_init_attr *qp_attr);
static void post_receives(struct connection *conn);
static void register_memory(struct connection *conn);

int on_connect_request(struct rdma_cm_id *id)
{
  struct ibv_qp_init_attr qp_attr;
  struct rdma_conn_param cm_params;
  struct connection *conn;

  printf("received connection request.\n");

  build_context(id->verbs);
  build_qp_attr(&qp_attr);

  TEST_NZ(rdma_create_qp(id, s_ctx->pd, &qp_attr));

  id->context = conn = (struct connection *)malloc(sizeof(struct connection));
  conn->qp = id->qp;

  register_memory(conn);
  post_receives(conn);

  memset(&cm_params, 0, sizeof(cm_params));
  TEST_NZ(rdma_accept(id, &cm_params));

  return 0;
}

We postpone building the verbs context until we receive our first connection request because the rdmacm listener ID isn’t necessarily bound to a specific RDMA device (and associated verbs context). However, the first connection request we receive will have a valid verbs context structure at id->verbs. Building the verbs context involves setting up a static context structure, creating a protection domain, creating a completion queue, creating a completion channel, and starting a thread to pull completions from the queue:

struct context {
  struct ibv_context *ctx;
  struct ibv_pd *pd;
  struct ibv_cq *cq;
  struct ibv_comp_channel *comp_channel;

  pthread_t cq_poller_thread;
};

static void * poll_cq(void *);

static struct context *s_ctx = NULL;

void build_context(struct ibv_context *verbs)
{
  if (s_ctx) {
    if (s_ctx->ctx != verbs)
      die("cannot handle events in more than one context.");

    return;
  }

  s_ctx = (struct context *)malloc(sizeof(struct context));

  s_ctx->ctx = verbs;

  TEST_Z(s_ctx->pd = ibv_alloc_pd(s_ctx->ctx));
  TEST_Z(s_ctx->comp_channel = ibv_create_comp_channel(s_ctx->ctx));
  TEST_Z(s_ctx->cq = ibv_create_cq(s_ctx->ctx, 10, NULL, s_ctx->comp_channel, 0));
  TEST_NZ(ibv_req_notify_cq(s_ctx->cq, 0));

  TEST_NZ(pthread_create(&s_ctx->cq_poller_thread, NULL, poll_cq, NULL));
}

Using a completion channel allows us to block the poller thread waiting for completions. We create the completion queue with cqe set to 10, indicating we want room for ten entries on the queue. This number should be set large enough that the queue isn’t overrun. The poller waits on the channel, acknowledges the completion, rearms the completion queue (with ibv_req_notify_cq()), then pulls events from the queue until none are left:

static void on_completion(struct ibv_wc *wc);

void * poll_cq(void *ctx)
{
  struct ibv_cq *cq;
  struct ibv_wc wc;

  while (1) {
    TEST_NZ(ibv_get_cq_event(s_ctx->comp_channel, &cq, &ctx));
    ibv_ack_cq_events(cq, 1);
    TEST_NZ(ibv_req_notify_cq(cq, 0));

    while (ibv_poll_cq(cq, 1, &wc))
      on_completion(&wc);
  }

  return NULL;
}

Back to our connection request. After building the verbs context, we have to initialize the queue pair attributes structure:

void build_qp_attr(struct ibv_qp_init_attr *qp_attr)
{
  memset(qp_attr, 0, sizeof(*qp_attr));

  qp_attr->send_cq = s_ctx->cq;
  qp_attr->recv_cq = s_ctx->cq;
  qp_attr->qp_type = IBV_QPT_RC;

  qp_attr->cap.max_send_wr = 10;
  qp_attr->cap.max_recv_wr = 10;
  qp_attr->cap.max_send_sge = 1;
  qp_attr->cap.max_recv_sge = 1;
}

We first zero out the structure, then set the attributes we care about. send_cq and recv_cq are the send and receive completion queues, respectively. qp_type is set to indicate we want a reliable, connection-oriented queue pair. The queue pair capabilities structure, qp_attr->cap, is used to negotiate minimum capabilities with the verbs driver. Here we request ten pending sends and receives (at any one time in their respective queues), and one scatter/gather element (SGE; effectively a memory location/size tuple) per send or receive request. After building the queue pair initialization attributes, we call rdma_create_qp() to create the queue pair. We then allocate memory for our connection context structure (struct connection), and allocate/register memory for our send and receive operations:

const int BUFFER_SIZE = 1024;

void register_memory(struct connection *conn)
{
  conn->send_region = malloc(BUFFER_SIZE);
  conn->recv_region = malloc(BUFFER_SIZE);

  TEST_Z(conn->send_mr = ibv_reg_mr(
    s_ctx->pd, 
    conn->send_region, 
    BUFFER_SIZE, 
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE));

  TEST_Z(conn->recv_mr = ibv_reg_mr(
    s_ctx->pd, 
    conn->recv_region, 
    BUFFER_SIZE, 
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE));
}

Here we allocate two buffers, one for sends and the other for receives, then register them with verbs. We specify we want local write and remote write access to these memory regions. The next step in our connection-request event handler (which is getting rather long) is the pre-posting of receives. The reason it is necessary to post receive work requests (WRs) before accepting the connection is that the underlying hardware won’t buffer incoming messages — if a receive request has not been posted to the work queue, the incoming message is rejected and the peer will receive a receiver-not-ready (RNR) error. I’ll discuss this further in another post, but for now it suffices to say that receives have to be posted before sends. We’ll enforce this by posting receives before accepting the connection, and posting sends after the connection is established. Posting receives requires that we build a receive work-request structure and then post it to the receive queue:

void post_receives(struct connection *conn)
{
  struct ibv_recv_wr wr, *bad_wr = NULL;
  struct ibv_sge sge;

  wr.wr_id = (uintptr_t)conn;
  wr.next = NULL;
  wr.sg_list = &sge;
  wr.num_sge = 1;

  sge.addr = (uintptr_t)conn->recv_region;
  sge.length = BUFFER_SIZE;
  sge.lkey = conn->recv_mr->lkey;

  TEST_NZ(ibv_post_recv(conn->qp, &wr, &bad_wr));
}

The (arbitrary) wr_id field is used to store a connection context pointer. Finally, having done all this setup, we’re ready to accept the connection request. This is accomplished with a call to rdma_accept().

The next event we need to handle is RDMA_CM_EVENT_ESTABLISHED, which indicates that a connection has been established. This handler is simple — it merely posts a send work request:

int on_connection(void *context)
{
  struct connection *conn = (struct connection *)context;
  struct ibv_send_wr wr, *bad_wr = NULL;
  struct ibv_sge sge;

  snprintf(conn->send_region, BUFFER_SIZE, "message from passive/server side with pid %d", getpid());

  printf("connected. posting send...\n");

  memset(&wr, 0, sizeof(wr));

  wr.opcode = IBV_WR_SEND;
  wr.sg_list = &sge;
  wr.num_sge = 1;
  wr.send_flags = IBV_SEND_SIGNALED;

  sge.addr = (uintptr_t)conn->send_region;
  sge.length = BUFFER_SIZE;
  sge.lkey = conn->send_mr->lkey;

  TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr));

  return 0;
}

This isn’t radically different from the code we used to post a receive, except that send requests specify an opcode. Here, IBV_WR_SEND indicates a send request that must match a corresponding receive request on the peer. Other options include RDMA write, RDMA read, and various atomic operations. Specifying IBV_SEND_SIGNALED in wr.send_flags indicates that we want completion notification for this send request.

The last rdmacm event we want to handle is RDMA_CM_EVENT_DISCONNECTED, where we’ll perform some cleanup:

int on_disconnect(struct rdma_cm_id *id)
{
  struct connection *conn = (struct connection *)id->context;

  printf("peer disconnected.\n");

  rdma_destroy_qp(id);

  ibv_dereg_mr(conn->send_mr);
  ibv_dereg_mr(conn->recv_mr);

  free(conn->send_region);
  free(conn->recv_region);

  free(conn);

  rdma_destroy_id(id);

  return 0;
}

All that’s left for us to do is handle completions pulled from the completion queue:

void on_completion(struct ibv_wc *wc)
{
  if (wc->status != IBV_WC_SUCCESS)
    die("on_completion: status is not IBV_WC_SUCCESS.");

  if (wc->opcode & IBV_WC_RECV) {
    struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;

    printf("received message: %s\n", conn->recv_region);

  } else if (wc->opcode == IBV_WC_SEND) {
    printf("send completed successfully.\n");
  }
}

Recall that in post_receives() we set wr_id to the connection context structure. And that’s it! Building is straightforward, but don’t forget -lrdmacm. Complete code, for both the passive side/server and the active side/client, is available here. It’s far from optimal, but I’ll talk more about optimization in later posts.

In my next post I’ll describe the implementation of the active side.

Updated, Oct. 4: Sample code is now at https://github.com/tarickb/the-geek-in-the-corner/tree/master/01_basic-client-server.

Advertisements

12 responses

  1. Brian Alexander

    Thank you for writing this series of blog entries. I am new to IB/OFED and this is the best introduction to writing RDMA applications that I’ve found. I am looking forward to reading this entire category of entries in your blog.

    October 6, 2010 at 2:35 pm

  2. kingss

    Hey geek,
    found your write ups very useful. One thing wanted to know was why is the context needed in iverbs. I needs the basics of it. whats the bigger picture here of context and udata.

    February 10, 2011 at 11:58 pm

    • In my examples I use the context pointer to associate a struct connection with the rdmacm ID. This simplifies the callback code, since callbacks can use a pointer to the struct connection rather than have to look it up in a table of some sort. In your own applications you can use the context pointer for whatever you want, or not use it at all. It’s up to you.

      February 14, 2011 at 9:01 pm

  3. Richard Sharpe

    Hmmm,

    Why do you do this:

    memcpy(&event_copy, event, sizeof(*event));
    rdma_ack_cm_event(event);

    rather than, say, responding to the event and then ACK’ing the event?

    I can see why you do the memcpy. It is probably because the call to rdma_ack_cm_event will change the event structure.

    Do you ack it before responding because the client will not do anything until it gets an ACK and thus you would have a deadlock if you didn’t?

    February 8, 2012 at 10:50 am

    • Richard Sharpe

      OK, now I understand. The rdma_cm_id for the listening ‘socket’ is different from those for connected ‘sockets,’ which are allocated when you get the connect request, most likely.

      Since you are passing the event into each of the handlers, you could have ACK’s the event at the point it was handled, but I guess that the model would get very complicated because the first ACK would ack the event on the listening rdma_cm_id, and subsequent ones would ack those on respective connected sockets …

      February 8, 2012 at 1:09 pm

      • Actually, that behavior is a vestige of an earlier version of the sample code where I had a separate thread process CM events. That version copied the event, acknowledged it, then posted it to a queue for the event-handler thread to process asynchronously. Even then, it wasn’t strictly necessary — the event handler thread could have acknowledged the event, but I found it cleaner to keep the get/ack calls in the same place. There’s nothing stopping you (that I’m aware of) from getting the event, handling it, then acknowledging it.

        February 11, 2012 at 1:56 pm

  4. Richard Sharpe

    Hmmm, consulting the man page tells me that rdma_ack_cm_event will free the event structure, so obviously you need to copy it if you want to ack it before processing it.

    February 8, 2012 at 10:57 am

  5. Pingback: Basic flow control for RDMA transfers | The Geek in the Corner

  6. Sandor

    Hi Geek,
    if I have a program with multiple clients. Should the server only use one QP or should he create a QP for every connection?

    cheers

    December 17, 2013 at 11:43 am

    • Yes, you’ll need a QP for each connection. If you find yourself using up too much memory for receive buffers, look into shared receive queues (SRQs).

      January 21, 2014 at 10:55 pm

  7. Kevin

    Hi, thank you for sharing the information of IB and RDMA. It is really valuable and useful to me. The contents presented are about two-sided Send/Recv operations, i.e., a Send operation requires a matching Recv operation. As I know, RDMA operations are one-sided since these operations can complete without any knowledge of the remote process. But I don’t know how to program with these RDMA operations when building applications. Have you ever explored how to use such one-sided RDMA operations and would you mind to share some information about them?

    Best Regards,
    Kevin

    August 1, 2014 at 10:52 pm

  8. hexiaobai

    Hi geek!
    I want ask a question.If there are more than one clients connecting with the server,do the server only use one ibv_context,one pd,one cq
    Thanks

    April 14, 2016 at 11:57 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s