Programming odds and ends — InfiniBand, RDMA, and low-latency networking for now.

Building an RDMA-capable application with IB verbs, part 3: the client

In my last post, I covered the steps involved in building the passive/server side of our basic verbs application. In this post I’ll discuss the active/client side. Since the code is very similar, I’ll focus on the differences. To recap, the steps involved in connecting to the passive/server side are:

  1. Create an event channel so that we can receive rdmacm events, such as address-resolved, route-resolved, and connection-established notifications.
  2. Create a connection identifier.
  3. Resolve the peer’s address, which binds the connection identifier to a local RDMA device.
  4. Create a protection domain, completion queue, and send-receive queue pair.
  5. Resolve the route to the peer.
  6. Connect.
  7. Wait for the connection to be established.
  8. Post operations as appropriate.

On the command line, our client takes a server host name or IP address and a port number. We use getaddrinfo() to translate these two parameters to struct sockaddr. This requires that we include a new header file:

#include <netdb.h>

We also modify main() to determine the server’s address (using getaddrinfo()):

const int TIMEOUT_IN_MS = 500; /* ms */

int main(int argc, char **argv)
  struct addrinfo *addr;
  struct rdma_cm_event *event = NULL;
  struct rdma_cm_id *conn= NULL;
  struct rdma_event_channel *ec = NULL;

  if (argc != 3)
    die("usage: client <server-address> <server-port>");

  TEST_NZ(getaddrinfo(argv[1], argv[2], NULL, &addr));

  TEST_Z(ec = rdma_create_event_channel());
  TEST_NZ(rdma_create_id(ec, &conn, NULL, RDMA_PS_TCP));
  TEST_NZ(rdma_resolve_addr(conn, NULL, addr->ai_addr, TIMEOUT_IN_MS));


  while (rdma_get_cm_event(ec, &event) == 0) {
    struct rdma_cm_event event_copy;

    memcpy(&event_copy, event, sizeof(*event));

    if (on_event(&event_copy))


  return 0;

Whereas with sockets we’d establish a connection with a simple call to connect(), with rdmacm we have a more elaborate connection process:

  1. Create an ID with rdma_create_id().
  2. Resolve the server’s address by calling rdma_resolve_addr(), passing a pointer to struct sockaddr.
  3. Wait for the RDMA_CM_EVENT_ADDR_RESOLVED event, then call rdma_resolve_route() to resolve a route to the server.
  4. Wait for the RDMA_CM_EVENT_ROUTE_RESOLVED event, then call rdma_connect() to connect to the server.
  5. Wait for RDMA_CM_EVENT_ESTABLISHED, which indicates that the connection has been established.

main() starts this off by calling rdma_resolve_addr(), and the handlers for the subsequent events complete the process:

static int on_addr_resolved(struct rdma_cm_id *id);
static int on_route_resolved(struct rdma_cm_id *id);

int on_event(struct rdma_cm_event *event)
  int r = 0;

  if (event->event == RDMA_CM_EVENT_ADDR_RESOLVED)
    r = on_addr_resolved(event->id);
  else if (event->event == RDMA_CM_EVENT_ROUTE_RESOLVED)
    r = on_route_resolved(event->id);
  else if (event->event == RDMA_CM_EVENT_ESTABLISHED)
    r = on_connection(event->id->context);
  else if (event->event == RDMA_CM_EVENT_DISCONNECTED)
    r = on_disconnect(event->id);
    die("on_event: unknown event.");

  return r;

In our passive side code, on_connect_request() initialized struct connection and built the verbs context. On the active side, this initialization happens as soon as we have a valid verbs context pointer — in on_addr_resolved():

struct connection {
  struct rdma_cm_id *id;
  struct ibv_qp *qp;

  struct ibv_mr *recv_mr;
  struct ibv_mr *send_mr;

  char *recv_region;
  char *send_region;

  int num_completions;

int on_addr_resolved(struct rdma_cm_id *id)
  struct ibv_qp_init_attr qp_attr;
  struct connection *conn;

  printf("address resolved.\n");


  TEST_NZ(rdma_create_qp(id, s_ctx->pd, &qp_attr));

  id->context = conn = (struct connection *)malloc(sizeof(struct connection));

  conn->id = id;
  conn->qp = id->qp;
  conn->num_completions = 0;


  TEST_NZ(rdma_resolve_route(id, TIMEOUT_IN_MS));

  return 0;

Note the num_completions field in struct connection: we’ll use it to keep track of the number of completions we’ve processed for this connection. The client will disconnect after processing two completions: one send, and one receive. The next event we expect is RDMA_CM_EVENT_ROUTE_RESOLVED, where we call rdma_connect():

int on_route_resolved(struct rdma_cm_id *id)
  struct rdma_conn_param cm_params;

  printf("route resolved.\n");

  memset(&cm_params, 0, sizeof(cm_params));
  TEST_NZ(rdma_connect(id, &cm_params));

  return 0;

Our RDMA_CM_EVENT_ESTABLISHED handler also differs in that we’re sending a different message:

int on_connection(void *context)
  struct connection *conn = (struct connection *)context;
  struct ibv_send_wr wr, *bad_wr = NULL;
  struct ibv_sge sge;

  snprintf(conn->send_region, BUFFER_SIZE, "message from active/client side with pid %d", getpid());

  printf("connected. posting send...\n");

  memset(&wr, 0, sizeof(wr));

  wr.wr_id = (uintptr_t)conn;
  wr.opcode = IBV_WR_SEND;
  wr.sg_list = &sge;
  wr.num_sge = 1;
  wr.send_flags = IBV_SEND_SIGNALED;

  sge.addr = (uintptr_t)conn->send_region;
  sge.length = BUFFER_SIZE;
  sge.lkey = conn->send_mr->lkey;

  TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr));

  return 0;

Perhaps most importantly, our completion callback now counts the number of completions and disconnects after two are processed:

void on_completion(struct ibv_wc *wc)
  struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;

  if (wc->status != IBV_WC_SUCCESS)
    die("on_completion: status is not IBV_WC_SUCCESS.");

  if (wc->opcode & IBV_WC_RECV)
    printf("received message: %s\n", conn->recv_region);
  else if (wc->opcode == IBV_WC_SEND)
    printf("send completed successfully.\n");
    die("on_completion: completion isn't a send or a receive.");

  if (++conn->num_completions == 2)

Lastly, our RDMA_CM_EVENT_DISCONNECTED handler is modified to signal to the event loop in main() that it should exit:

int on_disconnect(struct rdma_cm_id *id)
  struct connection *conn = (struct connection *)id->context;







  return 1; /* exit event loop */

And that’s it. Once again, the source code for both the client and the server is available here. If you’ve managed to build everything properly, your output should look like the following:

On the server side:

$ /sbin/ifconfig ib0 | grep "inet addr"
          inet addr:  Bcast:  Mask:
$ ./server
listening on port 45267.
received connection request.
connected. posting send...
received message: message from active/client side with pid 29717
send completed successfully.
peer disconnected.

And on the client side:

$ ./client 45267
address resolved.
route resolved.
connected. posting send...
send completed successfully.
received message: message from passive/server side with pid 14943

The IP address passed to client is the IP address of the IPoIB interface on the server. As far as I can tell it’s an rdmacm requirement that the struct sockaddr passed to rdma_resolve_addr() point to an IPoIB interface.

So we now have a working pair of applications. The next post in this series will look at reading and writing directly from/to remote memory.

Updated, Oct. 4: Sample code is now at


13 responses

  1. IronTek

    Are these verbs documented somewhere? I’ve looked around the OpenFrabrics site as well as the files in the OFED distribution, but I can’t seem to find what I expect: A PDF that documents the verbs, structs, and such…

    December 5, 2011 at 4:51 pm

    • Ed

      Try the Mellanox website… Look for a PDF with title “RDMA Aware Networks Programming
      User Manual” It has a section on the IB Verbs API and the RDMA CM API.

      December 8, 2011 at 8:39 am

      • IronTek

        Thanks! Silly me for thinking that such documentation might be available from on the OFED site!

        December 19, 2011 at 2:08 pm

  2. Vee

    Mr. Geek, I have the following scenario. I need to create a single ibv_context and protection domain (pd), with multiple queue pairs. Each queue pair will be associated with a different compute node. The restriction for this single context is so that, i would only do memory registrations of all the buffers once and using a dynamic mechanism to share those keys with other peers on demand.
    If i follow the above coding strategies, i notice that the rdma_address_resolve() function is the one that initializes the context and pd. Extending this framework to multiple nodes, will put me in a situation where each connection to a remote node will have its own independent context. Hence no sharing of memory registrations! How do i go about approaching this problem ?

    My thoughts.. Can i make the first call a dummy rdma_address_resolve(), maybe on the local machine IP address (IPoIB). This will get me a context. Then use the same rdma_cm_id to establish connections to different remote nodes. Will this strategy work or the subsequent calls to rdma_address_resolve() overwrite the context and end me up in the old situation ?

    If this kind of a behaviour cannot be achieved using the rdmacm library, i may have to end up using socket calls and do all the gluing.


    February 2, 2012 at 12:21 am

    • If all of your queue pairs use the same HCA, the id->verbs pointer in on_addr_resolved will be the same every time. This means you only have one ibv_context, and one protection domain. Have a look at the server-side code — it maintains one context for multiple connections.

      February 20, 2012 at 1:16 pm

  3. Vee

    You are correct. The verbs pointer always returns the same address.

    April 11, 2012 at 1:38 pm

  4. Pingback: Basic flow control for RDMA transfers | The Geek in the Corner

  5. stebanoid

    All code examples that I can find, uses IP addresses and rdma_address_resolve() functions to connection establishment. But there is no IP addresses in Infiniband, and that is indicate that this examples uses other upper layer protocols, such as IPoIB.
    But I want to use only native IB protocols of connection establishment using ServiceID (well known, for example) and GUID.
    Can You tell me where I can find a examples of code with connection establishment not using IP, but using ServiceID, for example?

    December 24, 2012 at 8:53 am

    • I plan on writing a post about this in January, but meanwhile, look into using AF_IB addresses with rdmacm. There’s not much documentation out there so you may need to dig into some OpenFabrics source.

      December 26, 2012 at 10:00 am

      • bsmith

        Did you ever get a chance to write about using AF_IB instead of ipoib functionality? I can’t find any useful documentation or examples for using AF_IB and your message was over 3 years ago….

        May 18, 2016 at 11:21 am

        • bsmith

          Note: I meant that to mean there is still not much documentation and it has been over 3 years. Not that it was your responsibility to write some 🙂

          May 18, 2016 at 11:23 am

  6. Hi geek,

    I am running in problem while scaling up my application because processes/nodes run out of queue pair. Does rdma cm support XRC at all? There are scatter discussions on the interweb but not much more.


    April 18, 2013 at 9:09 am

  7. hexiaobai

    I wank to ask a question.Because the server and client also send the buffer to each other.But why the result is that client send the buffer first.why not the server first send the buffer?

    April 14, 2016 at 9:27 am

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s