Programming odds and ends — InfiniBand, RDMA, and low-latency networking for now.

Building an RDMA-capable application with IB verbs, part 1: basics

If you’re looking to build an application that uses InfiniBand natively, now would be a good time to ask yourself if you wouldn’t be better off using one of InfiniBand’s upper-layer protocols (ULPs), such as IP-over-IB/SDP or RDS, or, most obviously, MPI. Writing programs using the verbs library (libibverbs, but I’ll refer to it as ibverbs) isn’t hard, but why reinvent the wheel?

My own reasons for choosing ibverbs rather than MPI or any of the available ULPs had to do with comparative performance advantages over IPoIB and that my target applications are ill-suited to the MPI message-passing model. MPI-2’s one-sided communication semantics would probably have worked, but for reasons irrelevant to this discussion MPI is/was a non-starter anyway.

Before looking at the details of programming with ibverbs, we should cover some prerequisites. I strongly recommend reading though the InfiniBand Trade Association’s introduction — chapters one and four in particular (only thirteen pages!). I’m also going to assume that you’re comfortable programming in C, and have at least passing familiarity with sockets, MPI, and networking in general.

Our goal is to connect two applications such that they can exchange data. With reliable, connection-oriented sockets (i.e., SOCK_STREAM) , this involves setting up a listening socket on the server side, and connecting to it from the client side. Once a connection is established, either side can call send() and recv() to transfer data. This doesn’t change much with ibverbs, but things are done in a much more explicit manner. The significant differences are:

  • You’re not limited to send() and recv(). Reading and writing directly from/to remote memory (i.e., RDMA) is enormously useful.
  • Everything is asynchronous. Requests are made and notification is received at some point in the future that they have (or have not) completed.
  • At the application level, nothing is buffered. Receives have to be posted before sends. Memory used for a send request cannot be modified until the request has completed.
  • Memory used for send/receive operations has to be registered, which effectively “pins” it such that it isn’t swapped out.

So in an InfiniBand world, how do we establish connections between applications? If you’ve read the IBTA’s introduction you’ll know that the key components we need to set up are the queue pair (consisting of a send queue and a receive queue on which we post send and receive operations, respectively) and the completion queue, on which we receive notification that our operations have completed. Each side of a connection will have a send-receive queue pair and a completion queue (but note that the mapping between an individual send or receive queue and completion queues within any given application can be many-to-one). I’m going to focus on the reliable, connected service (similar to TCP) for now. In later posts I’ll explore the datagram service.

Building queue pairs and connecting them to each other, such that operations posted on one side are executed on the other, involves the following steps:

  1. Create a protection domain (which associates queue pairs, completion queues, memory registrations, etc.), a completion queue, and a send-receive queue pair.
  2. Determine the queue pair’s address.
  3. Communicate the address to the other node (through some out-of-band mechanism).
  4. Transition the queue pair to the ready-to-receive (RTR) state and then the ready-to-send (RTS) state.
  5. Post send, receive, etc. operations as appropriate.

Step four in particular isn’t very pleasant, so we’ll use a event-driven connection manager (CM) to connect queue pairs, manage state transitions, and handle errors. We could use the InfiniBand Connection Manager (ib_cm), but the RDMA Connection Manager (available in librdmacm, and also known as the connection manager abstraction) uses a higher-level IP address/port number abstraction that should be familiar to anyone who’s written a sockets program.

This gives us two distinct procedures, one for the passive (responder) side of the connection, and another for the active (initiator) side:

Passive Side

  1. Create an event channel so that we can receive rdmacm events, such as connection-request and connection-established notifications.
  2. Bind to an address.
  3. Create a listener and return the port/address.
  4. Wait for a connection request.
  5. Create a protection domain, completion queue, and send-receive queue pair.
  6. Accept the connection request.
  7. Wait for the connection to be established.
  8. Post operations as appropriate.

Active Side

  1. Create an event channel so that we can receive rdmacm events, such as address-resolved, route-resolved, and connection-established notifications.
  2. Create a connection identifier.
  3. Resolve the peer’s address, which binds the connection identifier to a local RDMA device.
  4. Create a protection domain, completion queue, and send-receive queue pair.
  5. Resolve the route to the peer.
  6. Connect.
  7. Wait for the connection to be established.
  8. Post operations as appropriate.

Both sides will share a fair amount of code — steps one, five, seven, and eight on the passive side are roughly equivalent to steps one, four, seven, and eight on the active side. It may or may not be worth pointing out that as with sockets once the connection has been established, both sides are peers. Making use of the connection requires that we post operations on the queue pair. Receive operations are posted (unsurprisingly) on the receive queue. On the send queue, we post send requests, RDMA read/write requests, and atomic operation requests.

The next two posts will describe in detail the construction of two applications: one will act as the passive/server side and the other will act as the active/client side. Once connected, the applications will exchange a simple message and disconnect.

If you haven’t already, download and install the OpenFabrics software stack. You’ll need it to build the sample code provided in the next posts.

Updated, Nov. 6: Fixed link to IBTA introduction.

39 responses

  1. Mark Travis

    Thank you very much, Geek! You are a huge time-saver for me. I have been attempting to create an Infiniband-based application by reading the sources in the perftest programs which ship with OpenIB. And for theory, I’m referring to Intel press’ “Infiniband Architecture Development and Deployment.” The perftest programs don’t seem to use the RDMA Connection Manager, so your approach seems to be much better than the one which I have bee pursuing, and I expect that it will save me a lot of time in creating my application, as well as trouble-shooting and other ongoing maintenance.

    Your tutorial is an enormous improvement over what I’ve been chugging along with, which is essentially man pages, sparsely-documented code, and a decent book.

    I don’t have any other feedback than to express my appreciation for your efforts.

    November 28, 2010 at 10:02 pm

    • I appreciate the feedback! Feel free to post development/troubleshooting questions in the comments — I’m putting together a post on common pitfalls and troubleshooting tips.

      November 29, 2010 at 9:16 pm

  2. Mark Travis

    Thanks for the continued offer of assistance. Fortunately, I’m doing quite well. I’m able to send messages between my nodes! I’ve struggled mainly with coming to terms with how the connection characteristics, such as the context, change continuously during the setup process. But I think I have it under control now. Thanks very much for your tutorial and assistance. It’s really appreciated.

    Mark

    December 6, 2010 at 4:39 am

  3. Steven Haid

    I am working on building an RDMA capable kernel module. I’ve located the source code for the ib_core module, and find similarities between the exports from ib_core and the verbs/rdma APIs that are documented in the RDMA Aware Network Programming Users Manual.

    Can you point me to any sample programs and/or documentation for how to interface directly to the ib_core module? Thank You.

    December 16, 2010 at 10:47 am

    • I haven’t myself tried using IB from a kernel module, but take a look at the IP-over-IB and iSER (iSCSI Extensions for RDMA) modules in drivers/infiniband/ulp/ipoib and drivers/infiniband/ulp/iser, respectively, both in the kernel source tree. You’re right though — the interface is very similar to the user-space verbs API and, aside from memory registration, the concepts should more or less map straight over.

      December 20, 2010 at 7:51 pm

      • Thanks for your help – I’ve made progress using IB from a kernel module. I have used the rdma_cm to establish the QPs. And have used ib_post_recv to receive messages, and have used ib_post_send with IB_WR_SEND, and IB_WR_RDMA_WRITE successfully.

        I am having trouble getting IB_WR_RDMA_READ to work. The RDMA read requests are getting IB_WC_WR_FLUSH_ERROR completion status. The MR is being allocated with IB_ACCESS_LOCAL_WRITE, IB_ACCESS_REMOTE_READ and IB_ACCESS_REMOTE_WRITE.

        I suspect the problem may be related to the qp_attr.qp_access_flags. The value of this flag is 2, which is IB_ACCESS_REMOTE_WRITE. The QP is being transitioned through it’s states by rdma_cm. I have tried, but not been successful in modifying the qp_access_flags after the QPs have been initialized by rdma_cm.

        Do you have any suggestions on what I should look at to resolve the problem. Thank You!

        February 24, 2011 at 10:05 am

        • Glad to hear you’re making progress. As for your problems with IB_WR_RDMA_READ:

          1. Ensure that the responder_resources and initiator_depth members of the struct rdma_conn_param you pass to rdma_connect() or rdma_accept() are at least 1. These parameters limit the number of queued RDMA read operations accepted from or sent to the peer, respectively.

          2. Are you calling ib_modify_qp() with IB_QP_ACCESS_FLAGS set in the last parameter? Is the call to ib_modify_qp() failing?

          3. IB_WC_WR_FLUSH_ERROR (and its userspace analog IBV_WC_WR_FLUSH_ERROR) is set for all outstanding requests in a queue after it has transitioned to the error state and doesn’t indicate the actual error. You’ll have to inspect all the completions to determine what caused the QP to enter the error state.

          Hope this helps!

          February 24, 2011 at 7:42 pm

      • Setting the responder_resources and initiator_depth fixed the problem. Thank you for your help!

        March 2, 2011 at 8:20 am

        • I am working on RDMA capable kernel module. Currently, I run user-space code on one machine and kernel-space code another machine. I am doing rdma read from kernel space which is successfully done. However, when I use kernel-space code on both sides I get “mlx5_warn:mlx5_0:dump_cqe:257:(pid 0): dump error cqe” and wc.status is returned 11. Did you face any similar issue?

          July 28, 2016 at 5:15 am

  4. Andrew

    Thanks you for your examples on RDMA read/write, it’s really helpfull to me! but could you provide an example of RDMA atomic operations, there is almost no materials on this in the Internet, appreciate if you can.

    January 17, 2011 at 11:41 pm

    • True, there’s not much out there about atomics with verbs. I’ve not looked at this too closely myself, but I will when I finish up my messaging-protocols post.

      January 20, 2011 at 9:29 pm

    • Atomic operations are used when one wishes to perform Read-modify-write in one atomic operation.

      There are 2 types of atomic operations:
      1) CMP&SWP
      2) Fetch&Add

      I hope this helped a little bit…

      June 5, 2012 at 12:20 pm

  5. Mark Travis

    Geek! Am I lame, or need there be an ibv_post_recv() on the destination QP for every single ibv_post_send() from the sender? That’s what I’m finding. I know that I need to pre-post a receive before accepting a connection. However it also appears necessary to pre-post a receive before every message sent.

    An ibv_post_send() from the sender which does not follow an ibv_post_recv() on the receiver results in an ibv_wc_status 13: IBV_WC_RNR_RETRY_EXC_ERR
    Subsequent ibv_post_send() calls result in status 5: IBV_WC_WR_FLUSH_ERR

    My max_send_wr is 31, and max_recv_wr is 16.

    This, to me, shouldn’t be right–a slight delay on the receiving side to perform ibv_post_recv() means that the sending side has to, I assume, resubmit the message.

    This is essentially a maximum queue depth of 1. I’m using different wr_id’s, and different sg_list addresses for each ibv_post_send(), so each message is as discrete as I can make them.

    Am I crazy? If not, then what’s the normal practice for flow control? Just hope the receiver doesn’t fall behind, and retry? Should I fiddle with the rnr_retry and retry_cnt parameters?

    Thanks very much for your assistance!
    Mark

    February 1, 2011 at 10:43 pm

  6. Mark Travis

    Hi, Geek. I think I’ve figured out what I’m trying to do. I guess receives can be queued up on the receiver side, so I don’t necessarily have to execute an ibv_post_recv() after every ibv_post_send(). Since I can (by default with my HCAs) queue up 16 recv’s at a time, then I assume that I can tolerate at least a little bit of lag on the receiving side before the sending side gets RNR’s.

    Funny thing, even though I know that each QP includes a receive queue, conceptually it’s very strange to think in terms of a queue full of receives.

    February 2, 2011 at 3:45 am

    • In principle, if node A is sending (with IBV_WR_SEND, IBV_WR_SEND_WITH_IMM, or IBV_WR_RDMA_WRITE_WITH_IMM) to node B, then yes, you have to post one receive on B for every send posted by A. Furthermore, assuming you don’t want to encounter an RNR NAK, that receive on B has to be posted before the send is posted by A. It’s logical enough — with no buffering, the receiving HCA on B cannot store data from A without knowing where in memory to write. You’re right though, there’s nothing stopping you from posting a large number of receives (e.g., 16) on B, and reposting a receive every time you get a completion. In effect, this keeps the receive queue on B full.

      However, this won’t necessarily stop the sender from overrunning the receiver. You’ve got several ways of dealing with overrun: implement a “credit” scheme whereby the sender gets a certain number of credits that are decremented when it posts a send and incremented when it receives an acknowledgment; rely on RNR completion errors to throttle down your send rate; or, and this is probably the easiest way to go, set rnr_retry_count to 7, which indicates that the HCA should retry infinitely when it receives an RNR NAK. This last option isn’t ideal, though, because you consume bandwidth continually resending the first packet of your message until the receiver has posted a receive (see Section 9.7.5.2.8, RNR NAK, in the IB spec). Implementing a credit scheme would add complexity, but it is the most flexible/efficient approach. Can you elaborate a little on your application’s communication patterns?

      February 4, 2011 at 2:14 pm

  7. Mark Travis

    Geek, thanks for the information, as well as asking me to elaborate.

    Would you please send me your email address (I think you have mine as a result of my posting this message)?

    February 4, 2011 at 6:57 pm

  8. Gustavo

    Hi, I am just starting with Infiniband. I want to send messages like “Hello World”. Is it possible with this?. I don’t want to use MPI, because I need to make a relatively low-level program to control the Infiniband port. Is it possible with IBverbs? Thanks, Gustavo

    March 2, 2011 at 6:20 am

    • It certainly is possible (this series of posts does precisely that — it sets up a connection and exchanges messages between hosts), but I’m curious: what kind of low-level control do you need that’s preventing you from using MPI?

      March 3, 2011 at 11:06 am

      • Gustavo

        Thanks Geek for your answer! First of all, apologize for my english! I need to programme a simple C program, to connect both PCs. Why I need to do this? Because later I will connect one of the PCs to a “particular
        device” which has an Infiniband port. At this moment I don’t know which kind of protocol uses that device. The only thing I know, is that this “particular device” can send and receive bytes. So, the first step I want to accomplish in order to do that, is connect 2 PCs with a simple C program, with a “device level code”.
        I’m trying to use a File Descriptor with IOCTLS commands, using open(), read() and write(). But my problem
        is I don’t know wich Device File must I use (I am trying with differents files that I can see in /dev/Infiniband/ – I am using Centos 5.5). But I couldn’t reach any possitive result with this.
        Maybe I am in the wrong way, and for my purpose maybe I need to do another thing, and not use open(), read() and write(). So now I’m trying with ibverbs, and because of that I was lucky to visit your site. Thanks, Gustavo

        March 4, 2011 at 6:17 am

  9. Gustavo

    Geek: I think that for my purpose I will need use unreliable datagram. Will it be necessary modifying lot parts of the program, or just modifying rdma_create_id with rdma_create_id(ec, &listener, NULL, RDMA_PS_UDP)? Could You help me with this? Could you tell me where will I need to implement the changes?

    March 4, 2011 at 10:50 am

    • You shouldn’t have to change too much of the sample code to get it to work with UD QPs, but I’ve not experimented with this myself. You’ll have to pull the .qkey and .qp_num values from the .param.ud member of the struct rdma_cm_event delivered with RDMA_CM_EVENT_ESTABLISHED, as well as create an address handle using ibv_create_ah() with the .ah_attr member. The AH, qkey, and qp_num must then be passed to ibv_post_send() in the .wr.ud member of the WR you post.

      March 5, 2011 at 6:43 pm

  10. Pingback: 【转】Building an RDMA-capable application with IB verbs, part 2: the server - 大磊的blog | 大磊的blog

  11. Pingback: 【转】Building an RDMA-capable application with IB verbs, part 3: the client - 大磊的blog | 大磊的blog

  12. Pingback: RDMA read and write with IB verbs | The Geek in the Corner

  13. erenon

    The IB introduction document is moved to: https://cw.infinibandta.org/document/dl/7268

    November 6, 2013 at 4:20 am

  14. Sandor

    Hi,
    first of all thanks bunch for all those tutorials you wrote.

    For some reason though, the rdma cm won’t work on my system. All I get are RDMA_CM_EVENT_ADDR_ERROR (
    Address resolution (rdma_resolve_addr) failed) events. I’ve tried several things but to no avail.

    I wondered if you’ve ever stumbled upon the same problem?

    cheers
    Sandor

    November 22, 2013 at 5:13 am

    • What’re the things you’ve tried? Can the hosts ping each other on their IPoIB interfaces?

      November 22, 2013 at 11:13 pm

      • Sandor

        Yes, the client and the server can ping each other over the IPoIB interface. I modified the ‘addr’ structure so ‘rdma_bind_addr’ binds the rdma_cm_id explicitly to the IPoIB interface. I also tried changing the ‘rdma_port_space’ argument from the rdma_create_id.

        cheers

        November 26, 2013 at 2:51 am

        • Neither modification should be necessary, unless (maybe?) you have more than one RDMA-capable interface. What’s the value of event->status in on_event() when the client fails?

          November 26, 2013 at 10:49 pm

          • Sandor

            There should be only one RDMA capable interface. The value of event->status is -19. Errno is set to ‘success’.

            cheers

            November 28, 2013 at 5:27 am

  15. Hello,
    On my system there are 2 RDMA cable layers: iWARP (MIC) and IB. How do I specify rdma_cm to use IB because the default one is iWARP and my application is not working with iWARP

    December 12, 2013 at 2:41 pm

    • I can’t actually verify that this’ll work, but you could try modifying the struct sockaddr_in that’s passed to rdma_bind_addr() in server.c such that addr.sin_addr contains the IP address of your IB adapter rather than 0.0.0.0.

      January 21, 2014 at 11:07 pm

  16. Jagadeesh

    Hi,
    I have used user space verbs and kernel space verbs.
    When using kernel space verbs, before accessing the registered buffer by CPU / HCA it is recommanded to call ib_dma_sync_single_for_cpu(_device), for preventing CPU cache problems.
    But in user space verbs there are no such requirements and even no such API is provided. can you help me to understand, how user space is not affected by CPU caching.

    Regards
    Jagadeesh

    December 13, 2013 at 4:10 am

    • Jagadeesh

      I solved this problem, if any one face same drop me mail.

      May 28, 2014 at 8:25 am

      • Jagadeesh did you implement kernel-space rdma read and write?

        July 28, 2016 at 7:11 am

  17. Very clear, thanks. One question: is it possible for either the client or the server to save one memory copy and use the memory region given by mmap(file) as the RDMA buffer directly? I see something related (IB_UMEM_MEM_MAP) in the mail archives, but its not clear if that feature ever made it to usable state.

    (Sorry to post this question originally on the wrong page)

    February 1, 2015 at 12:09 am

  18. Tao

    Admirable material!

    July 5, 2017 at 3:30 am

  19. Pingback: 利用 ibverbs 實做 RDMA

Leave a comment