Ah, Proxmox with InfiniBand… This one’s been coming for quite a while… Proxmox is an incredibly useful and flexible platform for virtualization and in my opinion it holds it’s weight well even against the “big players” on-scene. E.g Hyper-V, ESXi, and even XCP-NG – given the correct use case.
That said, one of the things that makes Proxmox unique as compared with some of the other solutions mentioned above is that, at it’s core, Proxmox offers more than just a Hypervisor solution. What I mean by that is Proxomox has taken the approach of combining a strong hypervisor platform with some really great storage management options. Maybe the best example of this is the inclusion of Ceph – a solid and high performing distributed file system that offers a whole lot for those who can afford the infrastructure to support it.
Today however I’d like to focus on something a little more in reach for the “average” setup (for those who maybe can’t afford or don’t need all of the infrastructure recommended for Ceph), and so here we’ll focus on a three node setup that I built for a client that I think offers a good balance between redundancy and initial investment.
So why InfiniBand?
Well… to start with, the cost. 10Gb Ethernet solutions, despite having been around for quite a long while, are still quite. This is especially true when you start looking at buying two or three pairs of everything… Fiber Chanel will put you back even further. Back when I did this build, each Dual-Port FDR 54Gb/s Mellanox ConnectX-3 card ran only about $60 used on Ebay (now about half that price). Throw in a few DAC cables and you have host-to-host connectivity for around $150. No you read that right – no need for a switch even (though if you do have enough for two IB switches I would really recommend you go that route).
If not price, what about performance? While 10Gb Ethernet would have the bandwidth for most “average” setups (about 1,250MB/s nominal) there’s a really neat technology that is available with InfiniBand that isn’t as common (but does exist) for 10Gb Ethernet.
Say hello to Remote Direct Memory Access, or RDMA for short. RDMA is a pretty incredible solution that allows a host to access the memory of a different host directly and without using any CPU cycles on said remote host. Without this solution (and in the case of a 10Gb Ethernet network that doesn’t support RDMA for example) network traffic to a remote host must be processed though the network stack of that host – which means CPU cycles chewed up AND higher latency. Latency is really a killer when it comes to virtualization – maybe I’ll write more on that later… The main point here is that Proxmox with InfiniBand makes a strong case for itself in terms of price-to-performance.
A Basic Design
So here’s the approach I took. Three physical servers in total. One (which we will call “Store” from here on out) served mainly as the storage component of the setup – configured with a decent amount of both SSD’s and HDD’s in a few ZFS pools, but with a lower-powered CPU (less heat was the main consideration) and basically just enough RAM to feed ZFS and a NFS server with a bit of wiggle room to spare. The other two servers (hereafter to be named “Crunch 1” and “Crunch 2”) were setup with two SSD’s each in a ZFS mirror for the OS with some extra RAM and CPU cores for running VM Guests.
Note: For this kind of setup, redundancy is king. While one of the nodes power-cycling while a VM Guest is running on it may not be the end of the world, it can cause some issues especially if something else is going on that shouldn’t be (like Windows Updates that were installed without a reboot of the VM Guest… yeah… don’t do that). All the servers I used in this setup had dual PSU’s connected to dual UPS’s that were in turn connected to two separate power sources…
Let’s Build “Store”!
Starting right off with “Store” (remember this is the host we’ll be using to manage our storage and serve it to “Crunch 1” and “Crunch 2”), we’ll need to physically install the InfiniBand card and then of course Proxmox itself.
After that’s out of the way, we’ll need to setup some doodads to get our IB adapter to show up as a networking device. The first of these being “OpenSM” – a subnet manager for IB networks. Fire up that terminal and install OpenSM via:
apt-get install opensm
Now that we have our IB subnet manager installed, we’ll need to append “/etc/modules” to instruct the Proxmox kernel to load some InfiniBand-specific modules at boot time. So with the editor of your choice, add the following lines and then save and close the file.
mlx4_core
mlx4_ib
ib_umad
ib_uverbs
ib_ipoib
svcrdma
At this point we need to load the modules, “modprobe” is a perfectly valid option here but I normally just go with a quick reboot. Call it personal preference.
After everything comes back up, you should be able to see the IB adapter when running “ifconfig -a
” (if that utility isn’t installed, you can install it by running “apt-get install net-tools
“). It’s worth noting here that I’ve discovered that on Proxmox 5.4 the InfiniBand adapter interfaces will be named “ibX
“, while on Proxmox 6.3 it seems that has changed to “ibXsXdX
” – something to keep in mind for upgrades…
At this point you’ll be able to go into the Proxmox WebGUI and assign an IP to the InfiniBand interfaces just like you would any regular Ethernet adapter. Next we’ll step away from the IB part and move on to how we will serve the storage from this host to the others… We’re well on the way to the goal of Proxmox with shared storage over Infiniband!
Setting up NFS Server
While normally pretty straight forward, I have noticed some caveats with this particular setup. The first one of note is that when I tried to configured NFSv4 shares instead of NFSv3, the shares would mount correctly however they would hang almost immediately after a data transfer was initiated. Maybe you’ll have better luck or be able to figure out something I wasn’t at the time (if you do then please share with the rest of the class in the comments section) but in my case I ended up just going with NFSv3. So for this we’ll start with installing the server:
apt-get install zfs-kernel-server
Then we’ll need to setup the “exports” (the terminology used by NFS for file shares), this can be accomplished by editing the “/etc/exports” file and adding each export like such:
/tank 10.0.0.1/24(rw,async,insecure,no_root_squash)
The breakdown for which is as follows:
/tank
The local directory on the server to serve to NFS clients (which in our case are “Crunch 1” and “Crunch 2”).
10.0.0.1/24
The IP of the NFS client that will be connecting to this server (aka the IP addresses we will give to the IB interfaces on Crunch 1 and Crunch 2).
(rw,async,insecure,no_root_squash)
What you think I know all of that off the top of my head?! In all seriousness though, all the options here are well-documented and you can find what they do with a quick web search… *ehem* Moving on… You’ll want to create an export for each “Crunch” node that we’ll be connecting later on.
Addressing a Small Quirk with NFS + RDMA
This one actually took me a little while to figure out. Twice actually as the first time my documentation wasn’t up to snuff (clear and up-to-date documentation is your best friend).
What happens is that by default the NFS service doesn’t register the port for RDMA to portmapper for whatever reason… Because of this, if you have everything setup and ready to go but then try to connect over RDMA, the connection will fail. My workaround for this is pretty straight forward.
Create a directory in your filesystem root called “tools”:
mkdir /tools
Next, inside the directory we’ve just made, create a script and name it “open_rdma_port.sh
” that has the following contents:
#!/bin/sh
# Add the RDMA Port for Infiniband to Portlist
echo rdma 20049 > /proc/fs/nfsd/portlist
exit 0
After saving and closing the file, make it executable:
chmod +x open_rdma_port.sh
And finally, create a cron
entry to run this script every time the system boots. Said entry should look as such:
@reboot /tools/open_rdma_port.sh
That’s it – at this point NFS server should accept incoming connections over RDMA after every boot. Of course you can run the script manually as well for testing purposes.
On to the Crunchy Bits
Setting up the two remaining servers is comparatively simple. For each, we’ll start again with the physical IB card install and Proxmox install. Next, the same as with “Store”, we’ll install OpenSM:
apt-get install opensm
Similar to “Store” we’ll also need to add some modules to “/etc/modules
” – note in this case though, that the last module is different. On “Store” this was “svcrdma
” where as here we’ll want to add “xprtrdma
“. So again in the text editor of your choice add the following lines:
mlx4_core
mlx4_ib
ib_umad
ib_uberbs
ib_ipoib
xprtrdma
Save and close the file and reboot the host. After this has finished you should be able to see the InfiniBand interfaces in the Proxmox WebGUI. For the interface that you’ll be connecting to “Store” via DAC cable, make sure the IP settings are in the same subnet as the IP you’ve given to that interface on “Store”. E.g.
Store – ib01s1d0 – 10.0.0.1/24
Store – ib01s1d1 – 10.0.1.1/24
Crunch 1 – ib01s1d0 – 10.0.0.2/24
Crunch 2 – ib01s1d0 – 10.0.1.2/24
At this point connecting a DAC cable between the “Store” and “Crunch” hosts should allow you to ping the IP’s assigned to “Store”.
Setting up the NFS Client
Under normal situations, we’d simply configure the NFS client(s) in the Proxmox WebGUI (Storage > Add > NFS) but currently the WebGUI doesn’t offer the option of NFS over RDMA. Instead, the route I opted for is to manually mount the exports on Crunch 1 and Crunch 2, and then add those mount into Proxmox’s storage control as directories. To get started, create a directory on the Crunch servers that will serve as the mountpoint:mkdir /mnt/tank
Then we’ll create an entry in fstab
so that the shares are automatically mounted at boot time, this should look something like:
10.0.0.1:/tank /mnt/tank nfs rw,relatime,rsize=1048576,wsize=1048576,proto=rdma,port=20049,timeo=600,retrans=2,mountvers=3
The breakdown for which is as follows:
10.0.0.1:/tank
The IB IP assigned to “Store” followed by the exported directory.
/mnt/tank
The local directory to mount the remote exported directory to.
rw,relatime,rsize=1048576,wsize=1048576,proto=rdma,port=20049,timeo=600,retrans=2,mountvers=3
A few points to note: We’ve specified we want the share to be run over RDMA with the “proto=rdma” option. The “mountvers=3” option tells the NFS client to ask to connect using NFSv3 (in most cases, NFS server will offer multiple NFS versions simultaneously).
After saving and closing “fstab” it should be possible to mount the share from “Store”.
mount /mnt/tank
To check if it was successfully mounted with the correct options, run a mount command, checking the return for an entry for the NFS share. E.g
root@crunch1:~# mount
10.0.0.1:/tank on /mnt/tank type nfs (rw,relatime,vers=3,rsize=524288,wsize=524288,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=10.0.0.2,mountvers=3,mountproto=tcp,local_lock=none,addr=10.0.0.2)
Adding the New Storage in Proxmox
In most cases I’d recommend adding the local storage on “Store” in Proxmox – in my case this was a few ZFS pools. While not necessary per se it allows you to view your disk space usage and view the health of the pools from the WebGUI. Of course this local storage won’t be available to the other hosts, so select only the node “Store” here. On the “Crunch” side of things, we’ll add the NFS export mount points as a “Directory”. Of course, having the same name as the storage we added from “Store” could be confusing, so I chose to name these mount points “tank-cn”. Again as this mountpoint won’t be available to “Store”, we’ll only select the nodes “Crunch 1” and “Crunch 2”.
And That’s It!
You now have a high-bandwidth, shared storage setup that offers really short VM migration times and failover should one of the “Crunch” hosts go down. And all that for a pretty reasonable price. If you have anything to add or any questions please drop it in the comments section below. Until next time.
Mandatory Plug: If this helped you, please leave me a comment. If you really want to show me some love, maybe consider downloading the Brave browser using my affiliate link – it’s a great browser that has (among other features) Ad-Block and Private Browsing over the Tor network built in. Not only that but it doesn’t send every ounce of your business to the Google overlords. 🙂
Sweet! Have you tried it out with Ceph to see what speeds you get?
I haven’t yet no but I imagine that Ceph running over a InfiniBand network would run very well, especially given the advantages of IB’s RDMA over a standard fiber channel (no CPU cycles burned up by the network stack). Thanks for your comment!
can you show your /etc/exports file entries?
In my case I have two servers (the aforementioned “Crunch” servers) which access the Store server over NFS. This is why there are two entries with different IP’s.
In all actually, the config I run on that system is a bit more complicated as there are multiple pools (SSD-based, HDD-based, etc.) that are shared over NFS but this should get you up and running.