Basic flow control for RDMA transfers

Commenter Matt recently asked about sending large(r) amounts of data:

… I’m wondering if you would be able to provide some pointers or even examples that send very large amounts of data. e.g. sending files up to or > 2GB. Your examples use 1024 byte buffers. I suspect there is an efficient way of doing this given that there is a 2**31 limit for the message size.

I should point out that I don’t have lots of memory available as it’s used for other things.

There are many ways to do this, but since I’ve already covered using send/receive operations and using RDMA read/write, this would be a good occasion to combine elements of both and talk about how to handle flow control in general. I’ll also talk a bit about the RDMA-write-with-immediate-data (IBV_WR_RDMA_WRITE_WITH_IMM) operation, and I’ll illustrate these methods with a sample that transfers, using RDMA, a file specified on the command line.

As in previous posts, our sample consists of a server and a client. The server waits for connections from the client. The client does essentially two things after connecting to the server: it sends the name of the file it’s transferring, and then sends the contents of the file. We won’t concern ourselves with the nuts and bolts of establishing a connection; that’s been covered in previous posts. Instead, we’ll focus on synchronization and the flow control. What I’ve done with the code for this post though is invert the structure I built for my RDMA read/write post — there, I had the connection management code separated into client.c and server.c with the completion-processing code in common.c whereas here I’ve centralized the connection management in common.c and divided the completion processing between client.c and server.c.

Back to our example. There are many ways we could orchestrate the transfer of an entire file from client to server. For instance:

Load the entire file into client memory, connect to the server, wait for the server to post a set of receives, then issue a send operation (on the client side) to copy the contents to the server.
Load the entire file into client memory, register the memory, pass the region details to the server, let it issue an RDMA read to copy the entire file into its memory, then write the contents to disk.
As above, but issue an RDMA write to copy the file contents into server memory, then signal it to write to disk.
Open the file on the client, read one chunk, wait for the server to post a receive, then post a send operation on the client side, and loop until the entire file is sent.
As above, but use RDMA reads.
As above, but use RDMA writes.

Loading the entire file into memory can be impractical for large files, so we’ll skip the first three options. Of the remaining three, I’ll focus on using RDMA writes so that I can illustrate the use of the RDMA-write-with-immediate-data operation, something I’ve been meaning to talk about for a while. This operation is similar to a regular RDMA write except that the initiator can “attach” a 32-bit value to the write operation. Unlike regular RDMA writes, RDMA writes with immediate data require that a receive operation be posted on the target’s receive queue. The 32-bit value will be available when the completion is pulled from the target’s queue.

Update, Dec. 26: Roland D. rather helpfully pointed out that RDMA write with immediate data isn’t supported by iWARP adapters. We could rewrite to use an RDMA write (without immediate data) followed by a send, but this is left as an exercise for the reader.

Now that we’ve decided we’re going to break up the file into chunks, and write the chunks one at a time into the server’s memory, we need to find a way to ensure we don’t write chunks faster than the server can process them. We’ll do this by having the server send explicit messages to the client when it’s ready to receive data. The client, on the other hand, will use writes with immediate data to signal the server. The sequence looks something like this:

Server starts listening for connections.
Client posts a receive operation for a flow-control message and initiates a connection to the server.
Server posts a receive operation for an RDMA write with immediate data and accepts the connection from the client.
Server sends the client its target memory region details.
Client re-posts a receive operation then responds by writing the name of the file to the server’s memory region. The immediate data field contains the length of the file name.
Server opens a file descriptor, re-posts a receive operation, then responds with a message indicating it is ready to receive data.
Client re-posts a receive operation, reads a chunk from the input file, then writes the chunk to the server’s memory region. The immediate data field contains the size of the chunk in bytes.
Server writes the chunk to disk, re-posts a receive operation, then responds with a message indicating it is ready to receive data.
Repeat steps 7, 8 until there is no data left to send.
Client re-posts a receive operation, then initiates a zero-byte write to the server’s memory. The immediate data field is set to zero.
Server responds with a message indicating it is done.
Client closes the connection.
Server closes the file descriptor.

A diagram may be helpful:
File Transfer

Looking at this sequence we see that the server only ever sends small messages to the client and only ever receives RDMA writes from the client. The client only ever executes RDMA writes and only ever receives small messages from the server.

Let’s start by looking at the server. The connection-establishment details are now hidden behind rc_init(), which sets various callback functions, and rc_server_loop(), which runs an event loop:

int main(int argc, char **argv)
{
  rc_init(
    on_pre_conn,
    on_connection,
    on_completion,
    on_disconnect);

  printf("waiting for connections. interrupt (^C) to exit.\n");

  rc_server_loop(DEFAULT_PORT);

  return 0;
}

The callback names are fairly obvious: on_pre_conn() is called when a connection request is received but before it is accepted, on_connection() is called when a connection is established, on_completion() is called when an entry is pulled from the completion queue, and on_disconnect() is called upon disconnection.

In on_pre_conn(), we allocate a structure to contain various connection context fields (a buffer to contain data from the client, a buffer from which to send messages to the client, etc.) and post a receive work request for the client’s RDMA writes:

static void post_receive(struct rdma_cm_id *id)
{
  struct ibv_recv_wr wr, *bad_wr = NULL;

  memset(&wr, 0, sizeof(wr));

  wr.wr_id = (uintptr_t)id;
  wr.sg_list = NULL;
  wr.num_sge = 0;

  TEST_NZ(ibv_post_recv(id->qp, &wr, &bad_wr));
}

What’s interesting here is that we’re setting sg_list = NULL and num_sge = 0. Incoming RDMA write requests will specify a target memory address, and since this work request is only ever going to match incoming RDMA writes, we don’t need to use sg_list and num_sge to specify a location in memory for the receive. After the connection is established, on_connection() sends the memory region details to the client:

static void on_connection(struct rdma_cm_id *id)
{
  struct conn_context *ctx = (struct conn_context *)id->context;

  ctx->msg->id = MSG_MR;
  ctx->msg->data.mr.addr = (uintptr_t)ctx->buffer_mr->addr;
  ctx->msg->data.mr.rkey = ctx->buffer_mr->rkey;

  send_message(id);
}

This prompts the client to begin issuing RDMA writes, which trigger the on_completion() callback:

static void on_completion(struct ibv_wc *wc)
{
  struct rdma_cm_id *id = (struct rdma_cm_id *)(uintptr_t)wc->wr_id;
  struct conn_context *ctx = (struct conn_context *)id->context;

  if (wc->opcode == IBV_WC_RECV_RDMA_WITH_IMM) {
    uint32_t size = ntohl(wc->imm_data);

    if (size == 0) {
      ctx->msg->id = MSG_DONE;
      send_message(id);

      // don't need post_receive() since we're done with this connection

    } else if (ctx->file_name[0]) {
      ssize_t ret;

      printf("received %i bytes.\n", size);

      ret = write(ctx->fd, ctx->buffer, size);

      if (ret != size)
        rc_die("write() failed");

      post_receive(id);

      ctx->msg->id = MSG_READY;
      send_message(id);

    } else {
      memcpy(ctx->file_name, ctx->buffer, (size > MAX_FILE_NAME) ? MAX_FILE_NAME : size);
      ctx->file_name[size - 1] = '\0';

      printf("opening file %s\n", ctx->file_name);

      ctx->fd = open(ctx->file_name, O_WRONLY | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);

      if (ctx->fd == -1)
        rc_die("open() failed");

      post_receive(id);

      ctx->msg->id = MSG_READY;
      send_message(id);
    }
  }
}

We retrieve the immediate data field in line 7 and convert it from network byte order to host byte order. We then test three possible conditions:

If size == 0, the client has finished writing data (lines 9–14). We acknowledge this with MSG_DONE.
If the first byte of ctx->file_name is set, we already have the file name and have an open file descriptor (lines 15–30). We call write() to append the client’s data to our open file then reply with MSG_READY, indicating we’re ready to accept more data.
Otherwise, we have yet to receive the file name (lines 30–45). We copy it from the incoming buffer, open a file descriptor, then reply with MSG_READY to indicate we’re ready to receive data.

Upon disconnection, in on_disconnect(), we close the open file descriptor and tidy up memory registrations, etc. And that’s it for the server!

On the client side, main() is a little more complex in that we need to pass the server host name and port into rc_client_loop():

int main(int argc, char **argv)
{
  struct client_context ctx;

  if (argc != 3) {
    fprintf(stderr, "usage: %s <server-address> <file-name>\n", argv[0]);
    return 1;
  }

  ctx.file_name = basename(argv[2]);
  ctx.fd = open(argv[2], O_RDONLY);

  if (ctx.fd == -1) {
    fprintf(stderr, "unable to open input file \"%s\"\n", ctx.file_name);
    return 1;
  }

  rc_init(
    on_pre_conn,
    NULL, // on connect
    on_completion,
    NULL); // on disconnect

  rc_client_loop(argv[1], DEFAULT_PORT, &ctx);

  close(ctx.fd);

  return 0;
}

We don’t provide on-connection or on-disconnect callbacks because these events aren’t especially relevant to the client. The on_pre_conn() callback is fairly similar to the server’s, except that the connection context structure is pre-allocated, and the receive work request we post (in post_receive()) requires a memory region:

static void post_receive(struct rdma_cm_id *id)
{
  struct client_context *ctx = (struct client_context *)id->context;

  struct ibv_recv_wr wr, *bad_wr = NULL;
  struct ibv_sge sge;

  memset(&wr, 0, sizeof(wr));

  wr.wr_id = (uintptr_t)id;
  wr.sg_list = &sge;
  wr.num_sge = 1;

  sge.addr = (uintptr_t)ctx->msg;
  sge.length = sizeof(*ctx->msg);
  sge.lkey = ctx->msg_mr->lkey;

  TEST_NZ(ibv_post_recv(id->qp, &wr, &bad_wr));
}

We point sg_list to a buffer large enough to hold a struct message. The server will use this to pass along flow control messages. Each message will trigger a call to on_completion(), which is where the client does the bulk of its work:

static void on_completion(struct ibv_wc *wc)
{
  struct rdma_cm_id *id = (struct rdma_cm_id *)(uintptr_t)(wc->wr_id);
  struct client_context *ctx = (struct client_context *)id->context;

  if (wc->opcode & IBV_WC_RECV) {
    if (ctx->msg->id == MSG_MR) {
      ctx->peer_addr = ctx->msg->data.mr.addr;
      ctx->peer_rkey = ctx->msg->data.mr.rkey;

      printf("received MR, sending file name\n");
      send_file_name(id);
    } else if (ctx->msg->id == MSG_READY) {
      printf("received READY, sending chunk\n");
      send_next_chunk(id);
    } else if (ctx->msg->id == MSG_DONE) {
      printf("received DONE, disconnecting\n");
      rc_disconnect(id);
      return;
    }

    post_receive(id);
  }
}

This matches the sequence described above. Both send_file_name() and send_next_chunk() ultimately call write_remote():

static void write_remote(struct rdma_cm_id *id, uint32_t len)
{
  struct client_context *ctx = (struct client_context *)id->context;

  struct ibv_send_wr wr, *bad_wr = NULL;
  struct ibv_sge sge;

  memset(&wr, 0, sizeof(wr));

  wr.wr_id = (uintptr_t)id;
  wr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM;
  wr.send_flags = IBV_SEND_SIGNALED;
  wr.imm_data = htonl(len);
  wr.wr.rdma.remote_addr = ctx->peer_addr;
  wr.wr.rdma.rkey = ctx->peer_rkey;

  if (len) {
    wr.sg_list = &sge;
    wr.num_sge = 1;

    sge.addr = (uintptr_t)ctx->buffer;
    sge.length = len;
    sge.lkey = ctx->buffer_mr->lkey;
  }

  TEST_NZ(ibv_post_send(id->qp, &wr, &bad_wr));
}

This RDMA request differs from those used in earlier posts in two ways: we set opcode to IBV_WR_RDMA_WRITE_WITH_IMM, and we set imm_data to the length of our buffer.

That wasn’t too bad, was it? If everything’s working as expected, you should see the following:

ib-host-1$ ./server 
waiting for connections. interrupt (^C) to exit.
opening file test-file
received 10485760 bytes.
received 10485760 bytes.
received 5242880 bytes.
finished transferring test-file
^C

ib-host-1$ md5sum test-file
5815ed31a65c5da9745764c887f5f777  test-file

ib-host-2$ dd if=/dev/urandom of=test-file bs=1048576 count=25
25+0 records in
25+0 records out
26214400 bytes (26 MB) copied, 3.11979 seconds, 8.4 MB/s

ib-host-2$ md5sum test-file
5815ed31a65c5da9745764c887f5f777  test-file

ib-host-2$ ./client ib-host-1 test-file
received MR, sending file name
received READY, sending chunk
received READY, sending chunk
received READY, sending chunk
received READY, sending chunk
received DONE, disconnecting

If instead you see an error during memory registration, such as the following, you may need to increase your locked memory resource limits:

error: ctx->buffer_mr = ibv_reg_mr(rc_get_pd(), ctx->buffer, BUFFER_SIZE, IBV_ACCESS_LOCAL_WRITE) failed (returned zero/null).

The OpenMPI FAQ has a good explanation of how to do this.

Once more, the sample code is available here.

Updated, Dec. 21: Updated post to describe locked memory limit errors, and updated sample code to: check for ibv_reg_mr() errors; use basename() of file path rather than full path; add missing mode parameter to open() call; add missing library reference to Makefile. Thanks Matt.

Updated, Oct. 4: Sample code is now at https://github.com/tarickb/the-geek-in-the-corner/tree/master/03_file-transfer.

This entry was posted on December 19, 2012 by thegeekinthecorner. It was filed under InfiniBand, Verbs, RDMA .

→

←

34 responses

Matt

Thanks for responding with a while article. This is more than I expected.

A couple of comments. The file transfer diagram is very low resolution and I’m having trouble reading it.

The server core dumps for me.

If you happen to run the client/server on the same node I think it might try to overwrite the file. It fails of course and truncates it to zero. I’ll debug the server side for you but thanks for article, a quick skim through the process outlined and it makes sense to me.

December 20, 2012 at 4:04 pm

Reply
- thegeekinthecorner
  
  Thanks — I think I fixed the diagram resolution issue. And yes, if you run the client and server out of the same directory (on the same node, or on a network mount) you’ll end up truncating the file and doing weird things. Do let me know if you figure out what’s causing the server to die. Do you have a stack trace?
  
  December 20, 2012 at 4:14 pm
  
  Reply
Matt

I’m also having to compile this with -libverbs added to the compilation line.

There might be some problem with my IB setup. I ran the server on my dev machine and it ran without crashing. But the client also crashes if run on the other node. I suspect some setup issue, but hard to say. Your other examples run without any problems and other ib verbs based utilities also run without issue.

Here is the stack trace for the client:

Program terminated with signal 11, Segmentation fault.
#0 0x0000000000401bc2 in write_remote (id=0x23f51e0, len=10) at client.c:43
43 sge.lkey = ctx->buffer_mr->lkey;
(gdb) bt
#0 0x0000000000401bc2 in write_remote (id=0x23f51e0, len=10) at client.c:43
#1 0x0000000000401d71 in send_file_name (id=0x23f51e0) at client.c:89
#2 0x0000000000401ecf in on_completion (wc=0x7fbf7d49eea0) at client.c:116
#3 0x00000000004017ba in poll_cq (ctx=0x0) at common.c:143
#4 0x00007fbf7dac7e9a in start_thread ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007fbf7ddd3dbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

And for the server if run on the node that is having these issues.

#0 0x0000000000401bc2 in ibv_post_send (qp=0x7fff0536cc00, wr=0x0, bad_wr=0xa)
at /usr/include/infiniband/verbs.h:1000
1000 /usr/include/infiniband/verbs.h: No such file or directory.
(gdb) bt
#0 0x0000000000401bc2 in ibv_post_send (qp=0x7fff0536cc00, wr=0x0, bad_wr=0xa)
at /usr/include/infiniband/verbs.h:1000
#1 0x0000000000401d71 in on_pre_conn (id=0x401d71) at server.c:61
#2 0x0000000000401ecf in on_completion (wc=0x401ecf) at server.c:89
#3 0x00000000004017ba in event_loop (ec=0x401ecf, exit_on_disconnect=32703)
at common.c:114
#4 0x00007fbf7dac7e9a in start_thread ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#5 0x00007fbf7ddd3dbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()

December 20, 2012 at 5:15 pm

Reply
- Matt
  
  Ok I have gotten it to work now.
  
  The problem is that the default locked memory user limit of 64 on Ubuntu server.
  You have to edit /etc/security/limits.conf and add an entry to up the default.
  I’ve increased it to unlimited and then I got “open()” failed as the error instead.
  
  This error is because on Linux you need to add a 3rd mode parameter if O_CREAT is specified. i.e. change the open in server.c to
  
  ctx->fd = open(ctx->file_name, O_CREAT | O_WRONLY | O_TRUNC, S_IRUSR | S_IWUSR);
  
  December 20, 2012 at 8:09 pm
  
  Reply
  - Matt
    
    A 3rd issue, although this isn’t too important is that the filename is passed through to the server with the full path. If that path doesn’t exist on the server then the file can’t be created. You may want to add a call to basename() in the server to strip off the path.
    
    December 20, 2012 at 8:20 pm
    
    Reply
    - thegeekinthecorner
      
      This is what I get for coding while under the influence of severe jet lag. Thanks for the feedback. I’ve updated the sample code with these three fixes.
      
      December 21, 2012 at 1:46 am
      
      Reply
Allan

Wow, this is just what I was going to work on this week.

December 22, 2012 at 5:11 pm

Reply
Roland Dreier (@rolanddreier)

Nice write up. A couple of comments:

– For a real app, I would recommend pipelining multiple operations to take full advantage of the available bandwidth. For example, the server could start out by saying, “here are 8 destination buffers you can use,” and then the client can post 8 sends to start with. That way, you avoid the slack time after each send, where the client is waiting for the go-ahead for the next send from the server.

– Also, I would recommend not using RDMA write with immediate, since that’s not portable to iWARP systems. And especially in a bulk-transfer application like this, there’s really not much benefit over just posting an RDMA write followed by a send (you can even use one call to ibv_post_send() to post a list of two work requests in one go). But this is a pretty trivial point.

December 25, 2012 at 10:16 pm

Reply
- thegeekinthecorner
  
  Hey Roland — thank you for the comments. I agree entirely that RDMA write with immediate wouldn’t be the best choice for an application like this, but I wanted to write a post about it at some point and figured this would be as good an example as any. I’ve updated the post to mention the iWARP issue. As for pipelining in a real application: I think I’ll work that into a part-2 post.
  
  December 26, 2012 at 10:20 am
  
  Reply
- Allan
  
  just simple double buffering worked well for me. You need to adjust msg size and get it where the transit time for the ready and acks doesn’t exceed the transfer time for the RDMA. (we’re dealing with high latency reads/writes over long distances)
  
  Once again, thanks for the code and writeup, “geek”
  
  January 2, 2013 at 4:57 pm
  
  Reply
Matt

Hi Geek,

I’ve adapted your examples and put them to great use. However, I’m having some issues making the server work with multiple simultaneous client connections.

Do you think you could adapt your code for that scenario?

February 20, 2013 at 10:02 pm

Reply
- thegeekinthecorner
  
  Could you elaborate a little on the issues you’re having with multiple simultaneous client connections?
  
  February 24, 2013 at 4:35 pm
  
  Reply
Omar

Hi Geek
I am trying to set up a connection between multiple clients using the connection manager. I give my clients ranks from 0 to N. I call them P0,P1,P2,….Pn. Now processes send connection requests to only those processes with lesser ranks. P0 does not send connection request to any other process, but all processes send connect request to P0. Similarly P1 sends connection request only to Process P0. All processes except P0 send connection request to P1. This way we set up connection between all host/clients. I have several queries regarding this set up

1) If a server process is not running we get a RDMA_CM_EVENT_REJECTED error. What to do next. How to wait till the server process starts up and restart connection request

2) How to maintain an Index for all hosts. I mean all processes have an array of rdma_cm_id for all other processes in the system. All processes have information for P0 at index 0, P1 at index 1, P2 at index 2 and so on. How to ensure this order?

3) If i have a unique process ID, is it possible to exchange this information at the time of connection establishment, or should i exchange this information using send/receive once connections are established.

I hope i am making my point clear. Please suggest me possible ideas to implement the above mentioned system

Regards

Omar

April 10, 2013 at 6:39 am

Reply
- thegeekinthecorner
  
  Let me answer your questions in order:
  
  1. You could just keep retrying the connection until it works. If you’re going to do this, you should probably sleep() a bit between attempts.
  
  2. I’m not sure I understand this question. Does an array of struct rdma_cm_id pointers not work?
  
  3. Sure — you can use the private_data and private_data_len members of struct rdma_conn_param to pass data between the active and passive sides during connection establishment.
  
  April 11, 2013 at 6:16 pm
  
  Reply
  - Umer
    
    Thanks a lot for your response. is it possible for you to share some code snippet as to how the private_data field can be used. Both on the active and passive side.
    Regards
    
    April 18, 2013 at 9:44 am
    
    Reply
Omar

Hi Geek,
i want to know how to use 02 scatter and gather elements, if we have 02 buffers to send. The problem is that sometimes it might be only one buffer to send. In that case how to deal with the second sge. To be specific, i have 02 buffers. One is a static buffer for primitive datatypes, int, char, etc. One is a dynamic buffer for object serialization. The problem is that i don’t know at start up which send/receive will only have a static buffer and which one will also have a dynamic buffer. What should i do? I can use 02 send/receive operations. one for static buffer and one for dynamic, but there might be a better way of doing this.

Regards
OK

April 17, 2013 at 2:08 pm

Reply
Khan

Dear Geek
Let me first build up a scenario before i ask the question:
I have a linked list that has a work completion structure as on of its elements.

Every time i post a send / receive request i save the work queue Id wc.wr_id which i cast as done in your code;
wc.wr_id = (uint64_t)id
I save this work queue Id in a hashmap.

Now whenever i get a work completion i save it into the list i mentioned earlier. Now i want to check which work request has completed. I need to somehow compare the work request Id created while posting send/receive with the work completion i get when i poll ibv_poll_cq().
is it ok to do this? Do you have a better idea. I am extracting one entry at a time from the completion queue.

April 18, 2013 at 7:37 am

Reply
vamsi

Hi,
Thank you for your great work in giving concise document on RDMA. Can you please give me some suggestions on my design::
Actually I am going to use a shared memory device for RDMA communication. I just want to have RDMA protocol just above my shared memory device(No network stack). Will it be possible for me to use existing protocol stack and just use those hooks in my device driver to communicate between to hosts. Could you please clarify where the routing logic, flow ctrol & communication establishment sits in the protocol stack.
Once again thank you for your great work.
Regards,
Vamsi

July 15, 2013 at 9:42 am

Reply
- thegeekinthecorner
  
  This sounds like a question for the linux-rdma list. 🙂
  
  July 15, 2013 at 7:32 pm
  
  Reply
Sandeep

Hi Geek,
Appreciate your blog.
I wanted to know if it was possible for having REST based communication with Web server via http be enabled/modified so that it uses RDMA.
Example: File transfer via REST API over http powered by RDMA ?
I dont see any work around HTTP and RDMA. Your thoughts over its possibility ?

thx

November 25, 2013 at 11:48 am

Reply
- thegeekinthecorner
  
  I suppose it’s possible, but I’m not sure I understand why you’d want to do that. What’s the advantage of using HTTP?
  
  November 26, 2013 at 10:52 pm
  
  Reply
  - sandy
    
    Well , if you see the object storage provide REST based access and if one wants to leverage RDMA for accessing object store one will need RDMA support for http
    
    March 17, 2015 at 6:33 am
    
    Reply
Martin

Hello,
thank you for your helpful tutorials!

I got a question concerning rdma writes:
If the client issues a rdma-write to the server and wants to inform the server about the completion of this write you use either a following send or a write with immediate data (which is basically the same, I guess).
But the write-completion on the client-side actually only signals that the data has left. The client does not know if the data has already arrived on the other side. Additionally the send-message and the rdma-write data could use different routes through the network to the server. So in theory, the send-message could arrive an arbitrarily long time _before_ the data arrives at the server. Thus, the server gets the signal that a rdma-write has append before the data is there.
How do you deal with that? Is a rdma-write with immediate data special in this respect, meaning is it actually guaranteed that the message arrives always after the data has arrived?
Somehow every code I have seen using RDMA seems to ignore this problem.

Thanks again!
Cheers,
Martin

February 14, 2014 at 2:26 am

Reply
- thegeekinthecorner
  
  The RDMA code you’ve seen ignores this problem because it isn’t really a problem. 🙂 There are guarantees around transaction ordering. Have a look at section 9.5 of the IB specification.
  
  February 16, 2014 at 7:23 pm
  
  Reply
  - Martin
    
    Thank you for your reply. It is indeed guaranteed in the IB specifications. This seems also to be the case for iWARP-technology, on which I am working. So this really seems to be a non-problem. Thanks again!
    
    February 19, 2014 at 8:56 am
    
    Reply
federicosacerdoti

Very clear, thanks. One question: is it possible for either the client or the server to save one memory copy and use the memory region given by mmap(file) as the RDMA buffer directly? I see something related (IB_UMEM_MEM_MAP) in the mail archives, but its not clear if that feature ever made it to usable state.

(Sorry to post this question originally to the wrong page)

February 1, 2015 at 12:11 am

Reply
- thegeekinthecorner
  
  I’ve not tried, but this should be doable. It’s a pretty easy experiment. 🙂
  
  March 4, 2015 at 2:01 pm
  
  Reply
n

Hi,

I am trying to run the example but get a “unknown event” error with the client. I am guessing there is something that has changed over time. I am running mellanox OFED 3.2-2.0.0.0 on CentoOS 7.2 if that helps (does the same thing under Ubuntu with 4.0-2.0.0.1)

June 13, 2017 at 11:01 am

Reply
- isempty
  
  Hi. thegeekinthecorner.
  
  Thanks for sharing your code. I have same issue as above message. I got “unknown event” error with the client side. Something has changed over time because my OFED driver is latest version and OS as well. (Ubuntu 20.04).
  
  Would you share latest version of your code ?
  
  April 12, 2022 at 10:07 pm
  
  Reply
M Muneendra Kumar

Hi ,
Iam using the above test and iam able to run the same successfully.
One quick question i have in the client application
1) how the work queue entries are going to be filled when we call write_remote which internally calls ibv_post_send .

I observerd in kernel that rxe_req is making use of wqe to push the data down the stream..

Any help here will make to understand the things better

July 12, 2017 at 9:08 am

Reply
james vaigl

I know it’s been several years since you posted this, but I only recently came across it. Thanks very much for your informative write-ups and example code.

In a couple places, you call ibv_post_send() or ibv_post_recv() passing wr and sge that are stack variables. As soon the calling function returns, these variables are out of scope and undefined. Is this OK? The man page for ibv_post_recv(), for example, says the buffers used by a wr can only be safely reused after the request is fully executed and a completion received.

July 2, 2020 at 9:46 am

Reply
- thegeekinthecorner
  
  The requirement is that the buffer pointers in the WRs/SGEs (the pointer we write into sge.addr in the example above) must remain valid and cannot be reused until the operation completes and a completion is pulled off the associated CQ. However, the WRs/SGEs themselves are only used to set up the operation and are not required to outlive the calls to ibv_post_send() or ibv_post_recv(), so that’s why they are stack variables in the example code.
  
  July 2, 2020 at 12:28 pm
  
  Reply
  - james vaigl
    
    Thanks for the quick clarification. And again, thanks for your postings. They’re very, very helpful.
    
    July 2, 2020 at 3:48 pm
    
    Reply
liiiweiii

Hello, I have a question. If the file size to be transferred is 1GB and BUFFER-SIZE is 100MB, the memory size registered by the ibv_reg_mr function in the code is BUFFER-SIZE(100MB). Will the same registered memory be used when sending each chunk later?

March 11, 2024 at 8:53 am

Reply