Programming odds and ends — InfiniBand, RDMA, and low-latency networking for now.

RDMA read and write with IB verbs

In my last few posts I wrote about building basic verbs applications that exchange data by posting sends and receives. In this post I’ll describe the construction of applications that use remote direct memory access, or RDMA.

Why would we want to use RDMA? Because it can provide lower latency and allow for zero-copy transfers (i.e., place data at the desired target location without buffering). Consider the iSCSI Extensions for RDMA (iSER). The initiator, or client, issues a read request that includes a destination memory address in its local memory. The target, or server, responds by writing the desired data directly into the initiator’s memory at the requested location. No buffering, minimal operating system involvement (since data is copied by the network adapters), and low latency — generally a winning formula.

Using RDMA with verbs is fairly straightforward: first register blocks of memory, then exchange memory descriptors, then post read/write operations. Registration is accomplished with a call to ibv_reg_mr(), which pins the block of memory in place (thus preventing it from being swapped out) and returns a struct ibv_mr * containing a uint32_t key allowing remote access to the registered memory. This key, along with the block’s address, must then be exchanged with peers through some out-of-band mechanism. Peers can then use the key and address in calls to ibv_post_send() to post RDMA read and write requests. Some code might be instructive:

/* PEER 1 */

const size_t SIZE = 1024;

char *buffer = malloc(SIZE);
struct ibv_mr *mr;
uint32_t my_key;
uint64_t my_addr;

mr = ibv_reg_mr(
  pd, 
  buffer, 
  SIZE, 
  IBV_ACCESS_REMOTE_WRITE);

my_key = mr->rkey;
my_addr = (uint64_t)mr->addr;

/* exchange my_key and my_addr with peer 2 */
/* PEER 2 */

const size_t SIZE = 1024;

char *buffer = malloc(SIZE);
struct ibv_mr *mr;
struct ibv_sge sge;
struct ibv_send_wr wr, *bad_wr;
uint32_t peer_key;
uint64_t peer_addr;

mr = ibv_reg_mr(
  pd, 
  buffer, 
  SIZE, 
  IBV_ACCESS_LOCAL_WRITE);

/* get peer_key and peer_addr from peer 1 */

strcpy(buffer, "Hello!");

memset(&wr, 0, sizeof(wr));

sge.addr = (uint64_t)buffer;
sge.length = SIZE;
sge.lkey = mr->lkey;

wr.sg_list = &sge;
wr.num_sge = 1;
wr.opcode = IBV_WR_RDMA_WRITE;

wr.wr.rdma.remote_addr = peer_addr;
wr.wr.rdma.rkey = peer_key;

ibv_post_send(qp, &wr, &bad_wr);

The last parameter to ibv_reg_mr() for peer 1, IBV_ACCESS_REMOTE_WRITE, specifies that we want peer 2 to have write access to the block of memory located at buffer.

Using this in practice is more complicated. The sample code that accompanies this post connects two hosts, exchanges memory region keys, reads from or writes to remote memory, then disconnects. The sequence is as follows:

  1. Initialize context and register memory regions.
  2. Establish connection.
  3. Use send/receive model described in previous posts to exchange memory region keys between peers.
  4. Post read/write operations.
  5. Disconnect.

Each side of the connection will have two threads: the main thread, which processes connection events, and the thread polling the completion queue. In order to avoid deadlocks and race conditions, we arrange our operations so that only one thread at a time is posting work requests. To elaborate on the sequence above, after establishing the connection the client will:

  1. Send its RDMA memory region key in a MSG_MR message.
  2. Wait for the server’s MSG_MR message containing its RDMA key.
  3. Post an RDMA operation.
  4. Signal to the server that it is ready to disconnect by sending a MSG_DONE message.
  5. Wait for a MSG_DONE message from the server.
  6. Disconnect.

Step one happens in the context of the RDMA connection event handler thread, but steps two through six are in the context of the verbs CQ polling thread. The sequence of operations for the server is similar:

  1. Wait for the client’s MSG_MR message with its RDMA key.
  2. Send its RDMA key in a MSG_MR message.
  3. Post an RDMA operation.
  4. Signal to the client that it is ready to disconnect by sending a MSG_DONE message.
  5. Wait for a MSG_DONE message from the client.
  6. Disconnect.

Here all six steps happen in the context of the verbs CQ polling thread. Waiting for MSG_DONE is necessary otherwise we might close the connection before the peer’s RDMA operation has completed. Note that we don’t have to wait for the RDMA operation to complete before sending MSG_DONE — the InfiniBand specification requires that requests will be initiated in the order in which they’re posted. This means that the peer won’t receive MSG_DONE until the RDMA operation has completed.

The code for this sample merges a lot of the client and server code from the previous set of posts for the sake of brevity (and to illustrate that they’re nearly identical). Both the client (rdma-client) and the server (rdma-server) continue to operate different RDMA connection manager event loops, but they now share common verbs code — polling the CQ, sending messages, posting RDMA operations, etc. We also use the same code for both RDMA read and write operations since they’re very similar. rdma-server and rdma-client take either “read” or “write” as their first command-line argument.

Let’s start from the top of rdma-common.c, which contains verbs code common to both the client and the server. We first define our message structure. We’ll use this to pass RDMA memory region (MR) keys between nodes and to signal that we’re done.

struct message {
  enum {
    MSG_MR,
    MSG_DONE
  } type;

  union {
    struct ibv_mr mr;
  } data;
};

Our connection structure has been expanded to include memory regions for RDMA operations as well as the peer’s MR structure and two state variables:

struct connection {
  struct rdma_cm_id *id;
  struct ibv_qp *qp;

  int connected;

  struct ibv_mr *recv_mr;
  struct ibv_mr *send_mr;
  struct ibv_mr *rdma_local_mr;
  struct ibv_mr *rdma_remote_mr;

  struct ibv_mr peer_mr;

  struct message *recv_msg;
  struct message *send_msg;

  char *rdma_local_region;
  char *rdma_remote_region;

  enum {
    SS_INIT,
    SS_MR_SENT,
    SS_RDMA_SENT,
    SS_DONE_SENT
  } send_state;

  enum {
    RS_INIT,
    RS_MR_RECV,
    RS_DONE_RECV
  } recv_state;
};

send_state and recv_state are used by the completion handler to properly sequence messages and RDMA operations between peers. This structure is initialized by build_connection():

void build_connection(struct rdma_cm_id *id)
{
  struct connection *conn;
  struct ibv_qp_init_attr qp_attr;

  build_context(id->verbs);
  build_qp_attr(&qp_attr);

  TEST_NZ(rdma_create_qp(id, s_ctx->pd, &qp_attr));

  id->context = conn = (struct connection *)malloc(sizeof(struct connection));

  conn->id = id;
  conn->qp = id->qp;

  conn->send_state = SS_INIT;
  conn->recv_state = RS_INIT;

  conn->connected = 0;

  register_memory(conn);
  post_receives(conn);
}

Since we’re using RDMA read operations, we have to set initiator_depth and responder_resources in struct rdma_conn_param. These control the number of simultaneous outstanding RDMA read requests:

void build_params(struct rdma_conn_param *params)
{
  memset(params, 0, sizeof(*params));

  params->initiator_depth = params->responder_resources = 1;
  params->rnr_retry_count = 7; /* infinite retry */
}

Setting rnr_retry_count to 7 indicates that we want the adapter to resend indefinitely if the peer responds with a receiver-not-ready (RNR) error. RNRs happen when a send request is posted before a corresponding receive request is posted on the peer. Sends are posted with the send_message() function:

void send_message(struct connection *conn)
{
  struct ibv_send_wr wr, *bad_wr = NULL;
  struct ibv_sge sge;

  memset(&wr, 0, sizeof(wr));

  wr.wr_id = (uintptr_t)conn;
  wr.opcode = IBV_WR_SEND;
  wr.sg_list = &sge;
  wr.num_sge = 1;
  wr.send_flags = IBV_SEND_SIGNALED;

  sge.addr = (uintptr_t)conn->send_msg;
  sge.length = sizeof(struct message);
  sge.lkey = conn->send_mr->lkey;

  while (!conn->connected);

  TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr));
}

send_mr() wraps this function and is used by rdma-client to send its MR to the server, which then prompts the server to send its MR in response, thereby kicking off the RDMA operations:

void send_mr(void *context)
{
  struct connection *conn = (struct connection *)context;

  conn->send_msg->type = MSG_MR;
  memcpy(&conn->send_msg->data.mr, conn->rdma_remote_mr, sizeof(struct ibv_mr));

  send_message(conn);
}

The completion handler does the bulk of the work. It maintains send_state and recv_state, replying to messages and posting RDMA operations as appropriate:

void on_completion(struct ibv_wc *wc)
{
  struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;

  if (wc->status != IBV_WC_SUCCESS)
    die("on_completion: status is not IBV_WC_SUCCESS.");

  if (wc->opcode & IBV_WC_RECV) {
    conn->recv_state++;

    if (conn->recv_msg->type == MSG_MR) {
      memcpy(&conn->peer_mr, &conn->recv_msg->data.mr, sizeof(conn->peer_mr));
      post_receives(conn); /* only rearm for MSG_MR */

      if (conn->send_state == SS_INIT) /* received peer's MR before sending ours, so send ours back */
        send_mr(conn);
    }

  } else {
    conn->send_state++;
    printf("send completed successfully.\n");
  }

  if (conn->send_state == SS_MR_SENT && conn->recv_state == RS_MR_RECV) {
    struct ibv_send_wr wr, *bad_wr = NULL;
    struct ibv_sge sge;

    if (s_mode == M_WRITE)
      printf("received MSG_MR. writing message to remote memory...\n");
    else
      printf("received MSG_MR. reading message from remote memory...\n");

    memset(&wr, 0, sizeof(wr));

    wr.wr_id = (uintptr_t)conn;
    wr.opcode = (s_mode == M_WRITE) ? IBV_WR_RDMA_WRITE : IBV_WR_RDMA_READ;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    wr.send_flags = IBV_SEND_SIGNALED;
    wr.wr.rdma.remote_addr = (uintptr_t)conn->peer_mr.addr;
    wr.wr.rdma.rkey = conn->peer_mr.rkey;

    sge.addr = (uintptr_t)conn->rdma_local_region;
    sge.length = RDMA_BUFFER_SIZE;
    sge.lkey = conn->rdma_local_mr->lkey;

    TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr));

    conn->send_msg->type = MSG_DONE;
    send_message(conn);

  } else if (conn->send_state == SS_DONE_SENT && conn->recv_state == RS_DONE_RECV) {
    printf("remote buffer: %s\n", get_peer_message_region(conn));
    rdma_disconnect(conn->id);
  }
}

Let’s examine on_completion() in parts. First, the state update:

if (wc->opcode & IBV_WC_RECV) {
  conn->recv_state++;

  if (conn->recv_msg->type == MSG_MR) {
    memcpy(&conn->peer_mr, &conn->recv_msg->data.mr, sizeof(conn->peer_mr));
    post_receives(conn); /* only rearm for MSG_MR */

    if (conn->send_state == SS_INIT) /* received peer's MR before sending ours, so send ours back */
      send_mr(conn);
  }

} else {
  conn->send_state++;
  printf("send completed successfully.\n");
}

If the completed operation is a receive operation (i.e., if wc->opcode has IBV_WC_RECV set), then recv_state is incremented. If the received message is MSG_MR, we copy the received MR into our connection structure’s peer_mr member, and rearm the receive slot. This is necessary to ensure that we receive the MSG_DONE message that follows the completion of the peer’s RDMA operation. If we’ve received the peer’s MR but haven’t sent ours (as is the case for the server), we send our MR back by calling send_mr(). Updating send_state is uncomplicated.

Next we check for two particular combinations of send_state and recv_state:

if (conn->send_state == SS_MR_SENT && conn->recv_state == RS_MR_RECV) {
  struct ibv_send_wr wr, *bad_wr = NULL;
  struct ibv_sge sge;

  if (s_mode == M_WRITE)
    printf("received MSG_MR. writing message to remote memory...\n");
  else
    printf("received MSG_MR. reading message from remote memory...\n");

  memset(&wr, 0, sizeof(wr));

  wr.wr_id = (uintptr_t)conn;
  wr.opcode = (s_mode == M_WRITE) ? IBV_WR_RDMA_WRITE : IBV_WR_RDMA_READ;
  wr.sg_list = &sge;
  wr.num_sge = 1;
  wr.send_flags = IBV_SEND_SIGNALED;
  wr.wr.rdma.remote_addr = (uintptr_t)conn->peer_mr.addr;
  wr.wr.rdma.rkey = conn->peer_mr.rkey;

  sge.addr = (uintptr_t)conn->rdma_local_region;
  sge.length = RDMA_BUFFER_SIZE;
  sge.lkey = conn->rdma_local_mr->lkey;

  TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr));

  conn->send_msg->type = MSG_DONE;
  send_message(conn);

} else if (conn->send_state == SS_DONE_SENT && conn->recv_state == RS_DONE_RECV) {
  printf("remote buffer: %s\n", get_peer_message_region(conn));
  rdma_disconnect(conn->id);
}

The first of these combinations is when we’ve both sent our MR and received the peer’s MR. This indicates that we’re ready to post an RDMA operation and post MSG_DONE. Posting an RDMA operation means building an RDMA work request. This is similar to a send work request, except that we specify an RDMA opcode and pass the peer’s RDMA address/key:

wr.opcode = (s_mode == M_WRITE) ? IBV_WR_RDMA_WRITE : IBV_WR_RDMA_READ;

wr.wr.rdma.remote_addr = (uintptr_t)conn->peer_mr.addr;
wr.wr.rdma.rkey = conn->peer_mr.rkey;

Note that we’re not required to use conn->peer_mr.addr for remote_addr — we could, if we wanted to, use any address falling within the bounds of the memory region registered with ibv_reg_mr().

The second combination of states is SS_DONE_SENT and RS_DONE_RECV, indicating that we’ve sent MSG_DONE and received MSG_DONE from the peer. This means it is safe to print the message buffer and disconnect:

printf("remote buffer: %s\n", get_peer_message_region(conn));
rdma_disconnect(conn->id);

And that’s it. If everything’s working properly, you should see the following when using RDMA writes:

$ ./rdma-server write
listening on port 47881.
received connection request.
send completed successfully.
received MSG_MR. writing message to remote memory...
send completed successfully.
send completed successfully.
remote buffer: message from active/client side with pid 20692
peer disconnected.
$ ./rdma-client write 192.168.0.1 47881
address resolved.
route resolved.
send completed successfully.
received MSG_MR. writing message to remote memory...
send completed successfully.
send completed successfully.
remote buffer: message from passive/server side with pid 26515
disconnected.

And when using RDMA reads:

$ ./rdma-server read
listening on port 47882.
received connection request.
send completed successfully.
received MSG_MR. reading message from remote memory...
send completed successfully.
send completed successfully.
remote buffer: message from active/client side with pid 20916
peer disconnected.
$ ./rdma-client read 192.168.0.1 47882
address resolved.
route resolved.
send completed successfully.
received MSG_MR. reading message from remote memory...
send completed successfully.
send completed successfully.
remote buffer: message from passive/server side with pid 26725
disconnected.

One again, the sample code is available here.

Updated, Oct. 4: Sample code is now at https://github.com/tarickb/the-geek-in-the-corner/tree/master/02_read-write.

About these ads

71 responses

  1. Brian

    Thanks again for these blog entries – they were very helpful. Using them as a guide I was able to write a server that sends registers a memory region and sends the key to the client; and a matching client which receives the key and posts a RDMA write. Everything seems to work.

    October 21, 2010 at 8:45 am

  2. Manoj Nambiar

    Thanks. Your post has been helpful.

    Had a question – regarding the ibv_post_send.

    I am guessing that there are 3 ways to post a sequence of n rdma write requests (all of them within the same registered memory region).
    1. Use N work requests in linked list (using next) in 1 ibv_post_send call
    2. Use 1 work request with N entries in sg_list and num_sge = N in 1 ibv_post_send call
    3. Use N work requests with (1 …. N) ibv_post_send calls.

    Question 1: Are all the 3 methods functionally equivalent?
    Question 2: If alternative 2 is valid – then how do the addr parameters in ibv_sge relate to remote_addr in rdma field in union wr. Does the 2nd sg_list entry starting writing into the remote_addr in the peer at an offset where the first sg_list copy completed?
    Question 3: A work request also take an IBV_SEND_FENCE flag. Does this guarantee that the current work request will complete before the next, assuming out of order processing or work requests may take place for reasons of efficiency?

    October 27, 2010 at 1:12 am

    • I’m not terribly familiar with the scatter/gather features but I’ll answer to the best of my knowledge:

      1. Options #1 and #3 are functionally equivalent, but option #2 doesn’t offer the same guarantees in terms of the order in which the requests will execute. More on this in my answer to question 3. Something else to consider with option #2 is that some (most? all?) HCAs limit the number of scatter/gather entries (SGEs) in a work request. In my case, the magic number is 27.

      2. That would be correct. The same works in reverse (i.e., scattering with a receive or RDMA read request).

      3. Sends and RDMA writes are always processed in the order in which they’re posted. Fencing is necessary if you want to impose ordering guarantees on RDMA read or atomic operations. Have a look at table 76 (“Work Request Operation Ordering”) in the InfiniBand spec (release 1.2.1, section 10.8.3.3). Scatter/gather lists aren’t subject to the same ordering rules (see section 10.7.3.2).

      October 28, 2010 at 3:45 pm

      • Manoj Nambiar

        Thank you geek!

        Just the kind of clarifications I wanted ……

        November 8, 2010 at 11:59 pm

      • Mark Travis

        I just found that, using option #1, my application runs up against 1 less than the max_sge limit, which is 32 for my HCAs. Meaning I have a list of WR’s, each with a single SGE, and ibv_post_send() returns -1 and populates *bad_wr with the 32nd entry in my WR list. I can send up to 31 WR’s in the list with no problem, but the 32nd causes ibv_post_send() to err.

        Meanwhile, my maximum number of WR’s per QP (max_qp_wr) is 16351.

        Does this sound reasonable? At the very least, I would expect max_sge to work, but that exceeding max_sge would cause problems.

        It’s also very possible that I’m causing this problem in some other way as I’m just learning how to work with IB.

        December 19, 2010 at 4:07 am

        • As I understand it, max_sge is the maximum number of SGEs supported by the HCA, but not necessarily for any type of QP. Are you setting (and subsequently checking) the values of max_send_wr and max_send_sge in the cap member of struct ibv_qp_init_attr before (and after) calling rdma_create_qp()?

          December 20, 2010 at 8:05 pm

  3. Manoj Nambiar

    Yet another question.

    In your function on_completion there is a line

    wr.wr.rdma.remote_addr = (uintptr_t)conn->peer_mr.addr;
    wr.wr.rdma.rkey = conn->peer_mr.rkey;

    Now conn->peer_mr.addr is the same thing the remote peer has done an ibv_reg_mr with.
    Lets assume that this remote memory region is an array[10] of type X. And since I have registered this entire array with a sing ibv_reg_mr I will have only one rkey.

    So if I want to update, (say) the 3rd element of the array with an RDMA write I am tempted to use

    wr.wr.rdma.remote_addr = (uintptr_t)conn->peer_mr.addr + 2*sizeof(X);
    wr.wr.rdma.rkey = conn->peer_mr.rkey;

    Question is –
    Will this work as expected?
    Or am I supposed to register each element of the array with an ibv_reg_mr each with its own rkey and use those instead?

    The man pages for ibv_post_send and ibv_reg_mr do not help much. Reason why I am unsure is because the man page for ibv_post_send, lists “wr.wr.rdma.remote_addr” as a start address of the remote memory region.

    November 9, 2010 at 11:59 am

    • Manoj Nambiar

      Answering my own question …..

      That will not work.
      In other words wr.wr.rdma.remote_addr & wr.wr.rdma.rkey have to be used in combination.
      Adding an offset to wr.wr.rdma.remote_addr and using the same wr.wr.rdma.rkey will not work.

      I tried this myself and stumbled upon the answer here http://www.mail-archive.com/general@lists.openfabrics.org/msg19081.html

      What I understand is: If you want to do an rdma_write (or read) on a specific element of an array there are 2 ways to go
      1. Register the entire array as a memory region on both ends. Do a local array update. Do an RDMA write of the entire array. One side effect will be that all elements will be updated on the remote host. This may be undesirable depending on the application
      2. Regsiter each element of the array separately. Each element has its own rkey – so no side effects here.

      November 11, 2010 at 7:27 am

      • It’s absolutely possible to do what you described in your first comment — that is, in the struct ibv_send_wr that you pass to ibv_post_send(), you can specify any address in wr.rdma.remote_addr so long as wr.rdma.remote_addr >= mr->addr and wr.rdma.remote_addr + write_len <= mr->addr + mr->length where mr is the memory region registered with ibv_reg_mr() and write_len is the number of bytes in the write request. There’s no need to register each element of the array separately or to write the entire array if you’re only modifying one element.

        November 12, 2010 at 6:44 pm

  4. Manoj Nambiar

    Hi,

    Couple more questions …

    Is there a way to do RDMA writes without using a completion queue? I use an alternative channel to determine if my work requests were correctly executed or not. When I tried to do so I could send 510 (may be 512) work requests sucessfully. After that ibv_post_send returns 22. Checked up the error code which tells me invalid arguments.??? Can I tune this? Is there another way to clean up the work requests in the system? Pls note – I do not get this problem when I poll completion queues.

    Another question – an ibv_post_send with an IBV_WR_SEND needs an ibv_post_recv on the peer. However should the size of the message in sge.length (assuming 1 work request with 1 sge) match in the ibv_post_send and the peers ibv_post_recv? If not is there a way to get the same functionality – In my app the receiver will not know in advance the size of the message. Atleast in socket programming the message size in the send() system call did not have to match that of the peer’s recv()?

    Thanks.

    January 11, 2011 at 10:47 am

    • You can prevent the generation of completion queue entries (CQEs) for successful sends by creating the queue pair with sq_sig_all set to 0 (which is how my code sets up the queue pair) and by not setting IBV_SEND_SIGNALED in the send_flags member of struct ibv_send_wr. Note, however, that a CQE will still be generated if the operation failed.

      As for receive buffer lengths matching sends: there’s nothing stopping you from posting a receive with a larger buffer than the peer will send. The converse is not true though — if the receive you post is smaller than an incoming send, the operation will fail on both the receiver (with IBV_WC_LOC_LEN_ERR) and on the sender (with IBV_WC_REM_INV_REQ_ERR). If your application won’t know the message size ahead of time, and if your message sizes are large enough, I’d suggest using a two-stage transfer:

      1. Define a struct that contains a message size, an MR key, and an address.
      2. On the receiver, post a receive for a message the size of the struct you created in step 1.
      3. On the sender, register the memory region you want to transfer. Fill the struct you created in step 1 with the message size, the MR key, and the address of the buffer. Post a send for the struct.
      4. On the receiver, when the receive completes, post an RDMA read using the MR key and address in the struct sent by the sender.

      I’m working on a post that describes this process in more detail.

      January 15, 2011 at 3:43 pm

      • Manoj Nambiar

        Thanks Geek,

        I already tried “creating the queue pair with sq_sig_all set to 0 (which is how my code sets up the queue pair) and by not setting IBV_SEND_SIGNALED in the send_flags member of struct ibv_send_wr”. When I do that and not poll the completion queue., the 512th work request fails – all the time. Repeated retrying of ibv_post_send does not help my case. It just returns 22. Can I recover by doing an ibv_poll_cq after ibv_post_send returns the error? – This is one idea I am getting after reading your reply. The same program works when I enable IBV_SEND_SIGNALED and poll completion queues.

        I understood the answer for the second part of the question. The only part I am not happy about it the need for 2 ibv_post_recv calls. Thanks for the answer anyway.

        January 17, 2011 at 1:00 am

        • Glad to see you got an answer on linux-rdma for your signaled-send problem. Just to clarify the second part of my previous comment: there aren’t two receives posted — there’s the first receive, for the message struct, followed by an RDMA read (which is posted with ibv_post_send()). This’ll be clearer when I eventually get around to writing about it in detail.

          January 20, 2011 at 9:24 pm

  5. Greg Kerr

    Great article/ code. But when I run it, I get an RDMA_CM_EVENT_ADDR_ERROR from rdma_resolve_addr. The ip address I’m using seems right. Any suggestions as to what might cause such problems?

    March 25, 2011 at 4:34 pm

    • If you’re using InfiniBand (rather than, say, 10 GigE or iWARP), you need to make sure you’ve got your IP-over-IB interfaces correctly set up (as this is how the RDMA CM maps IP addresses to IB devices). Are you able to ping the remote host on its IPoIB interface (ib0, normally)? Is your subnet manager running?

      March 26, 2011 at 1:15 pm

      • Greg Kerr

        Thanks for your help. It appears, curiously, that the main node has ib0 configured, but the sub-nodes I am running on have only eth0 and lo showing up. However, the ibv_rc_pingpong program does work between the two subnodes. I’ll look into why ib0 is not configured on those nodes.

        By the way, I tried to change the ibv_rc_pingpong program to loop and keep sending messages instead of terminating after sending the first message. Oddly enough, everything works fine on the first send, but the call to ibv_post_receive fails on the 2nd time through the loop. There shouldn’t be a CQ overflow problem. Any tips as to what I might look for? I’m not sure how familiar you are with using ibverbs w/o RDMA.

        I am undergraduate research assistant working to checkpoint/restart Infiniband programs. I’m actually able to checkpoint/restart the basic ibv_rc_pingpong program right now, but I’ve not yet shown it to anyone outside my lab because its much too immature to be of any real use to researchers right now. Hence why I am trying to write this variation of ibv_rc_pingpong; I want more test cases that pass before I publicly claim that indeed I can do very primitive checkpoint/restart of ibverbs programs.

        March 28, 2011 at 10:34 am

        • ibv_rc_pingpong runs when ib0 is down because it uses its own out-of-band mechanism to exchange QP information between peers rather than using the RDMA connection manager. There’s nothing stopping you from using a similar approach, it’s just more work for more or less the same result.

          As for terminating after sending the first message — the quick test I just ran with ibv_rc_pingpong shows that it runs for 1,000 iterations by default. Are you not seeing the same behavior? How, precisely, is your ibv_post_recv() call failing?

          March 30, 2011 at 9:51 pm

  6. Jeff Becker

    Thank you for your informative posts on RDMA programming. I am putting together a tutorial on Infiniband Architecture, and was wondering if I could use some of your high level examples? I’ll be sure to attribute these to you , and put your web site URL on the corresponding slide. Thanks again.

    April 25, 2011 at 12:22 pm

  7. Satish C

    I tried to find forums for infiniband verbs API to post my question (without much success:) ).

    I am trying to modify uc_pingpong.c (ibv_uc_pingpong) to a one sided model in which one side sends a whole bunch of messages and the other side receives. On the server(sender) side , I wait for a send completion and then post the next send.

    My program hangs after the pp_post_send/ ibv_post_sends unless I introduce a delay between the send and the completion queue poll. Any ideas as to what causes this behavior? Am I flooding the send queue? or under unreliable connection is the completion event written way before the last bits of a packet are out of the send buffer?

    Thanks!

    June 14, 2011 at 11:15 am

    • I don’t think the work completion is being posted early. From the spec:

      C9-180: For an HCA requester using Unreliable Connection service, the requester shall consider a message Send (or RDMA WRITE) complete when either of the following conditions occurs: The requester has committed the last byte of the VCRC field of the last packet to the wire (and detected no local errors associated with the message transfer), or the requester has detected a local error associated with the message transfer that causes the requester to terminate sending the request.

      I’m not sure how much you’ve modified uc_pingpong.c (seeing the code would be useful), so I really can’t say much. If you’re flooding the send queue, your call to ibv_post_send() should fail.

      As far as a forum for all things verbs- and RDMA-related, try the linux-rdma mailing list.

      June 28, 2011 at 9:10 pm

  8. aydel

    Thank you for an excellent overview.

    I am trying to register (using ibv_reg_mr) many 1 GB blocks of memory.
    I seem to be running into a problem after about 30 blocks. Can this be changed?
    (I want to register over 160 blocks (160 GB) for transferring a lot of data around).

    THANKS!!!

    June 22, 2011 at 12:55 pm

    • Since calling ibv_reg_mr() causes the memory region you specify to be pinned (i.e., it won’t be swapped out), you will eventually run out of physical memory. Do you have at least 160 GB of physical memory? Are your user limits set appropriately (ulimit -l should be at least 160 GB)?

      June 28, 2011 at 8:46 pm

      • aydel

        ulimit -l -> unlimited.
        I have around 190 GB of physical memory on the main computer, and 90 GB on the compute nodes.

        My original though was to register most of the memory, and as data comes in from another machine (2GB/sec) every .8 sec, to move them to the appropriate compute node.
        Details of the master node:
        ibv_devinfo -v
        hca_id: mlx4_0
        transport: InfiniBand (0)
        fw_ver: 2.7.5558
        node_guid: 0025:90ff:ff07:4370
        sys_image_guid: 0025:90ff:ff07:4373
        vendor_id: 0x02c9
        vendor_part_id: 26428
        hw_ver: 0xA0
        board_id: SM_1051000009
        phys_port_cnt: 1
        max_mr_size: 0xffffffffffffffff
        page_size_cap: 0xfffffe00
        max_qp: 261056
        max_qp_wr: 16351
        device_cap_flags: 0x007c9c76
        max_sge: 32
        max_sge_rd: 0
        max_cq: 65408
        max_cqe: 4194303
        max_mr: 524272
        max_pd: 32764
        max_qp_rd_atom: 16
        max_ee_rd_atom: 0
        max_res_rd_atom: 4176896
        max_qp_init_rd_atom: 128
        max_ee_init_rd_atom: 0
        atomic_cap: ATOMIC_HCA (1)
        max_ee: 0
        max_rdd: 0
        max_mw: 0
        max_raw_ipv6_qp: 0
        max_raw_ethy_qp: 1
        max_mcast_grp: 8192
        max_mcast_qp_attach: 120
        max_total_mcast_qp_attach: 983040
        max_ah: 0
        max_fmr: 0
        max_srq: 65472
        max_srq_wr: 16383
        max_srq_sge: 31
        max_pkeys: 128
        local_ca_ack_delay: 15
        port: 1
        state: PORT_ACTIVE (4)
        max_mtu: 2048 (4)
        active_mtu: 2048 (4)
        sm_lid: 1
        port_lid: 1
        port_lmc: 0x00
        link_layer: IB
        max_msg_sz: 0x40000000
        port_cap_flags: 0x0251086a
        max_vl_num: 8 (4)
        bad_pkey_cntr: 0x0
        qkey_viol_cntr: 0x0
        sm_sl: 0
        pkey_tbl_len: 128
        gid_tbl_len: 128
        subnet_timeout: 18
        init_type_reply: 0
        active_width: 4X (2)
        active_speed: 10.0 Gbps (4)
        phys_state: LINK_UP (5)
        GID[ 0]: fe80:0000:0000:0000:0025:90ff:ff07:4371

        June 29, 2011 at 6:56 am

        • Try adjusting the mlx4_code module’s log_mtts_per_seg parameter. In newer OFED versions you can set it as high as 7, which should allow you to register more memory.

          June 29, 2011 at 12:08 pm

          • aydel

            Whats the easiest way to set this? Through code?

            Thanks!!!

            June 29, 2011 at 12:23 pm

          • You can’t set it programmatically — it has to be passed as a parameter when the mlx4_core module is loaded. Add a line containing options mlx4_core log_mtts_per_seg=7 to /etc/modprobe.conf, then reload your IB modules.

            June 29, 2011 at 12:42 pm

          • And be sure you’re running OFED 1.5.2 or newer.

            June 29, 2011 at 12:49 pm

          • aydel

            Thanks,

            Whats the easiest way to ” reload your IB modules”
            (?sudo /etc/init.d/opensmd restart)

            June 29, 2011 at 2:24 pm

          • sudo /etc/init.d/openibd restart should do it.

            June 29, 2011 at 2:32 pm

  9. aydel

    THANK YOU for all your help.

    I can now register 180+ 1GB blocks.

    June 29, 2011 at 5:30 pm

  10. aydel

    Can I change a registered memorys ibv_pd structure?
    What I am trying to do is set up a way to have a memory region on a master node associated with multiple compute nodes. I have logic to decide which particular one the data needs to be sent to at a particular time (this location can change to another computer node as other data is processed).

    If I understand correctly, a pd allow for memory locations from the master node to be associated with a particular compute node. If I have multiple compute nodes (say 6) I want associated with a particular memory location on the master node, I will need 6 ibv_reg_mr declared for the master node (one for each compute node).

    Is this correct?

    (setting log_mtts_per_seg=7 give me, I think 512 mr, which may not be enough).

    Thanks!

    July 8, 2011 at 7:20 am

    • From what you’ve described you should be able to use just one struct ibv_pd. If you’re posting RDMA reads/writes on the master node, you could register as few as one memory region (assuming your memory allocation is contiguous), and then post operations using that MR. If however the compute nodes will be posting RDMA reads/writes from/to the master node’s memory, then for the purposes of isolation you’ll probably want one MR per compute node (though this certainly isn’t required).

      July 15, 2011 at 8:51 pm

      • aydel

        When setting up the PD, it needs the address and port of the compute nodes. The memory regions are associated with that particular PD. (setting log_mtts_per_seg=7 give me, it turns out, 256 mr. With 4 compute nodes, this works out to 64 mr per pd (one for each compute node). I would prefer to register 180 or so GB of memory on the master node so I can do a RDMA write to any of the 4 compute nodes. It seems though each PD, is associated with a context which is associated with a ib address and a port, so the best I can do is 64 mr. (Memory is contiguous on both the master and compute nodes). ) Registered buffer sizes are 1GB (the max on Mellonox). Is there something I am missing?
        Thanks,

        July 17, 2011 at 8:38 am

        • ibv_alloc_pd() takes as its only argument a pointer to struct ibv_context, which is tied to a specific adapter — not to a specific queue pair or address/port on a peer. You can have just one PD and use it with as many queue pairs/compute nodes as you want.

          July 17, 2011 at 9:00 pm

  11. hk

    Thanks for your worw,it ‘s very useful. I can’t download the sample code.I need it for a test.could you send it to me by email as soon as possible? 490830134@qq.com. thank you again!

    July 12, 2011 at 1:50 am

  12. Cynthia Cool

    Hey, looks your sample code tar file is ruined. Can you send your sample to my email: cynthiasupercool@gmail.com. Thanks!

    October 21, 2011 at 5:08 pm

  13. Dissa

    Hi,
    Thank You, This Work Is Fantastic.
    I Need Your Help. I am Very New To InfiniBand.
    I Want To Port This To Win 2008 R2.
    Got Hold Of An InfiniHostIII Card And Trying To Use Mellanox VAPI.
    But Found No Traces Of Header File And VAPI Library To Compile And Link This.
    I Guess, I Can Run This On Windows Command Line And Intend To Use Visual Studio 2010.
    Please Guide Me. I am Trying To Make Use Of Mellanox VAPI.
    I Appreciate Any Help With This.
    Thanks,
    Dissa

    December 14, 2011 at 4:00 pm

  14. The fourth doorman of the apocalypse

    In looking through the code it seems that the polling thread might not get to polling for completion queue events before we post receives on the main thread (because of scheduling etc). (One way to prevent this is to use a semaphore or etc via pthreads so that the main thread does not post any receives until it know that the polling thread is running …)

    Is it possible to lose completion events this way?

    February 9, 2012 at 9:53 am

    • Nah, you wouldn’t lose any completion events. They’ll remain queued until you poll for them.

      February 11, 2012 at 1:21 pm

  15. Richard Sharpe

    While I am sure people will eventually find it, perhaps you should use ibv_wc_status_str rather than just ‘die(“on_completion: status is not IBV_WC_SUCCESS.”)’

    February 10, 2012 at 11:06 am

  16. Satish C

    In the above code(and ib verbs), is it difficult to have a buffer of doubles instead of chars. I am trying to send an array of doubles across.

    Thanks!

    May 29, 2012 at 3:04 pm

    • Satish C

      posted too soon… solved it!

      May 30, 2012 at 2:37 am

  17. George Meng

    Thanks for the article. It’s very hepful!
    In the sample, the send/recv buffers in connection is registered. Do they need to be pinned so that they won’t be swapped out of memory? When do we need this pinning mechanism?

    August 9, 2012 at 4:37 pm

    • Registering memory is necessary because the HCA doesn’t buffer transfers for you — any operation you perform on the queue pair (send, receive, RDMA operations, etc.) will read from or write to memory directly. The HCA can’t (to my knowledge) cause the kernel to page in memory that has been paged out. You could, if you really wanted to, register memory before a transfer and de-register it when the transfer completes, but this is not a great idea. Registering and de-registering memory is expensive.

      August 9, 2012 at 7:15 pm

  18. You need to mark the entry connected as volatile otherwise the code will not work when compiled in optimized mode.

    August 21, 2012 at 5:00 pm

    • Good point. I should mention though that the sample code here is intended to be illustrative — I wouldn’t recommend using it in a real-world application.

      September 4, 2012 at 7:05 pm

  19. I did benchmark the achievable transfer rate using you example (transfering chunks of 8 MB) and it seems that the bandwidth is arond 650 MBps, while ib_write_bw claims that bandwidth should be around 1500 MBps. Do you have an idea on why ?

    August 24, 2012 at 4:07 am

    • From your comments on linux-rdma I’m assuming the problem was that your memory wasn’t page-aligned?

      September 4, 2012 at 7:06 pm

  20. Justin

    Thanks for your article. It’s very helpful to me.

    I had some questions on that how I could change the source codes.
    I would like to do that the server will wait for message or data from client forever. This means the rdma-client will write some data on the server’s memory repeatedly.
    Of course your codes are beautiful, but I’m not still familiar with RDMA yet. So, please give me some hints.

    October 30, 2012 at 9:55 pm

    • Look at where MSG_DONE is sent in on_completion(). You could modify this to instead write another buffer to the peer, in a loop, as many times as you want, only sending MSG_DONE when you’re done.

      November 25, 2012 at 8:46 pm

  21. Daniil Kasyanov

    Thanks for you article.

    I have just started to study InfiniBand and RDMA. In your example you use IPoIB. My question is: Is it possible to write non IP based application, if yes, can you give me an example or idea?

    November 14, 2012 at 5:34 am

    • IPoIB is used by rdmacm to map peer addresses to InfiniBand addresses (GIDs), and isn’t used after the queue pairs are set up. Nevertheless, if you want to avoid it altogether, look into passing an AF_IB address to rdma_resolve_addr().

      November 25, 2012 at 8:34 pm

  22. torr

    Hi, thanks for great job!!
    Got some effect when testing ibv_rc_pingpong. I’d appriciate if you have any suggestions:
    Suppose we have 2 nodes connected by Infiniband
    node0 node1
    a) 1process(32mb) 1process(32mb) time==21sec
    b) 2proc(32 mb each) 2proc(32mb each) time==41sec
    c) 4proc(32 mb each) 4proc(32mb each) time==56sec
    d) 6proc(32 mb each) 6proc(32mb each) time==75sec
    e) 8proc(32 mb each) 8proc(32mb each) time==96sec

    f) 1proc(8*32 mb each) 1proc(8*32mb each) time==167sec

    cases a) and f) behaves as expected==> if u send 8 times larger message it ll take 8 times greater time.
    a) and b) are understandable too.
    I cant get how it comes with cases a) and {c)|d)|e)}??? Is there any feature in Infiniband protocol stack connected with multiple QP-s from different processes?

    November 19, 2012 at 8:08 am

    • Can you elaborate a little on your configuration? Which HCAs are you using, what link speed are they running, and how are you launching your multiple ibv_rc_pingpong processes?

      In principle it shouldn’t matter how many simultaneous processes (within reason) are using the IB adapter — you should be limited by the IB link (or, maybe, by the PCIe link).

      November 25, 2012 at 8:52 pm

  23. torr

    sry, formatting of the table in my previous post removed(

    November 19, 2012 at 8:12 am

  24. Pingback: Infiniband addressing - host names to IB address without IBoIP - feed99

  25. Pingback: Infiniband addressing – host names to IB address without IBoIP

  26. Matt

    Hi geek,

    Thanks for providing these examples they are useful. I’m wondering if you would be able to provide some pointers or even examples that send very large amounts of data. e.g. sending files up to or > 2GB. Your examples use 1024 byte buffers. I suspect there is an efficient way of doing this given that there is a 2**31 limit for the message size.

    December 17, 2012 at 7:15 pm

    • Matt

      I should point out that I don’t have lots of memory available as it’s used for other things.

      December 17, 2012 at 7:31 pm

    • I’ve been looking for ideas for my next post, and I think this would be a good one. Check back in a few days.

      December 18, 2012 at 1:45 pm

  27. Pingback: Basic flow control for RDMA transfers | The Geek in the Corner

  28. Omar Khan

    I am sending data to a remote node via RDMA_WRITE operation. Now at the sender and receiver, how would i be able to know that the send is complete and the data buffers can be reused. Since RDMA_WRITE does not notify the receiver of incoming data. I thought control messages, like you use in ur post, could work, but for multiple sends/receives, it is not possible. What if the sender writes to the remote memory again before the receiver is able to process the data. I need the sender to wait till the receiver signals it to start sending again. Adding a bit at the end of the data to indicate data reception is also not possible.

    September 20, 2013 at 4:14 am

  29. Great post but I’m still trying to wrap my head around why we need to “exchange” remote key/address. I mean, if the client is doing RDMA_READ, only the server needs to send its rkey/address to the client. One-way communication of rkey/address should be enough, right? I’m probably mistaken because my program is not working :( May be a related question is why do I need to do ibv_post_send(IBV_WR_SEND) from BOTH sides as the first step, while all I want to do is ibv_post_send(IBV_WR_RDMA_READ) from the client-side. Here is the code I’m trying to modify https://code.google.com/p/cpptruths/source/browse/trunk/c/rdma_rc_example.c

    October 17, 2013 at 6:24 pm

    • In my example, we exchange the keys/addresses because the client and server both RDMA-read (or -write, as the case may be) a buffer from each other’s memory. This is why there’s an IBV_WR_SEND from both sides. If the transfer is only in one direction, then you’re right — you don’t need to exchange keys.

      October 18, 2013 at 12:57 pm

  30. Omar khan

    Dear geek
    I have to ask you how to set up an all to all communication between a number of processes. What I have done is open a listening rdma_cm_id wait for incoming coming connection requests at each host and create new rdma_cm_id. This works fine if all processes are on different host machines, but if I start multiple processes on the same machine, I get a very slow performance or none at all, the system hangs as if in a deadlock. I had hoped that once I have a rdma_cm_id the processes should communicate without any problem. One thing is that I have only set up one communication channel but it should suffice for many clients (the man pages say this).
    Regards
    Omar

    January 28, 2014 at 9:49 pm

  31. Hello Geek,

    I have a problem getting the atomic operations to work. I can successfully send RDMA READ/WRITE operations using your code, but the ibv_post_send() for atomic operations fail with errno set to “Invalid Arguments”.

    Here’s the simplified code, I tried to leave unrelated parts out. But everything is pretty much the same as your code. I’d really appreciate your help.

    ******** client code **********
    void build_qp_attr(struct ibv_qp_init_attr *qp_attr){
    memset(qp_attr, 0, sizeof(*qp_attr));
    qp_attr->send_cq = s_ctx->cq;
    qp_attr->recv_cq = s_ctx->cq;
    qp_attr->qp_type = IBV_QPT_RC;

    qp_attr->cap.max_send_wr = 10;
    qp_attr->cap.max_recv_wr = 10;
    qp_attr->cap.max_send_sge = 1;
    qp_attr->cap.max_recv_sge = 1;
    }

    void register_memory(struct connection *conn) {
    local_buffer = new long long[1];
    local_mr = ibv_reg_mr(pd, local_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE));
    }

    void on_completion(struct ibv_wc *wc){
    struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;
    // Assume that the client already knows about the remote_mr on the server side (through sending and receiving some messages)

    if (wc->opcode & IBV_WC_RECV) {
    struct ibv_send_wr wr, *bad_wr = NULL;
    struct ibv_sge sge;

    memset(&sge, 0, sizeof(sge));
    sge.addr = (uintptr_t)local_buffer;
    sge.length = sizeof(long long);
    sge.lkey = local_mr->lkey;

    memset(&wr, 0, sizeof(wr));
    wr.wr_id = 0;
    wr.opcode = IBV_WR_ATOMIC_FETCH_AND_ADD;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    wr.send_flags = IBV_SEND_SIGNALED;

    wr.wr.atomic.remote_addr = (uintptr_t)remote_mr.addr;
    wr.wr.atomic.rkey = remote_mr.rkey;
    wr.wr.atomic.compare_add = 1ULL;

    if (ibv_post_send(qp, &wr, &bad_wr)) {
    fprintf(stderr, “Error, ibv_post_send() failed\n”);
    die();
    }

    }
    }
    ***** End of client code ********

    **** Serve code ******
    struct connection {
    struct rdma_cm_id *id;
    struct ibv_qp *qp;
    struct ibv_mr *mr;
    long long *rdma_buffer;
    };

    void build_qp_attr(struct ibv_qp_init_attr *qp_attr) {
    memset(qp_attr, 0, sizeof(*qp_attr));
    qp_attr->send_cq = s_ctx->cq;
    qp_attr->recv_cq = s_ctx->cq;
    qp_attr->qp_type = IBV_QPT_RC;

    qp_attr->cap.max_send_wr = 10;
    qp_attr->cap.max_recv_wr = 10;
    qp_attr->cap.max_send_sge = 1;
    qp_attr->cap.max_recv_sge = 1;
    }

    void register_memory(struct connection *conn){
    rdma_region = 1ULL;

    rm = ibv_reg_mr(pd, rdma_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC ));
    }
    ***** End of Server code *******

    December 4, 2014 at 4:21 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 32 other followers